Environment:Togethercomputer Together python Fine Tuning Data Requirements
| Knowledge Sources | |
|---|---|
| Domains | Fine_Tuning, Data_Validation |
| Last Updated | 2026-02-15 16:00 GMT |
Overview
File format, size, and structure requirements for datasets used in Together AI fine-tuning jobs.
Description
Together AI fine-tuning accepts datasets in JSONL or Parquet format. JSONL files must conform to one of four schema types (general, conversation, instruction, preference). Parquet files must contain specific columns. All files are validated client-side by the `check_file` utility before upload, with strict constraints on file size, sample count, and multimodal content.
Usage
Use this environment specification when preparing datasets for the Fine-Tuning workflow. The `check_file()` utility validates files against these requirements before upload.
System Requirements
| Category | Requirement | Notes |
|---|---|---|
| File Size | Maximum 50.1 GB | Per-file limit for fine-tuning uploads |
| File Size | Minimum > 0 bytes | Empty files are rejected |
| Samples | Minimum 1 sample | At least 1 valid example required |
| Disk | Sufficient for dataset | Plus ~2x for Parquet conversion if applicable |
Dependencies
Python Packages
- `together` SDK (core) — For `check_file` validation
- `pyarrow` >=10.0.1 — Required only for Parquet format files; install via `pip install together[pyarrow]`
Credentials
No credentials required for local file validation. `TOGETHER_API_KEY` is required for the subsequent upload step.
Quick Install
# Core SDK
pip install together
# With Parquet support
pip install "together[pyarrow]"
Code Evidence
File size validation from `src/together/utils/files.py:83-90`:
file_size = os.stat(file).st_size
if file_size > MAX_FILE_SIZE_GB * NUM_BYTES_IN_GB:
report_dict["message"] = (
f"Maximum supported file size is {MAX_FILE_SIZE_GB} GB. "
f"Found file with size of {round(file_size / NUM_BYTES_IN_GB, 3)} GB."
)
report_dict["is_check_passed"] = False
elif file_size == 0:
report_dict["message"] = "File is empty"
report_dict["is_check_passed"] = False
JSONL format definitions from `src/together/constants.py:54-74`:
class DatasetFormat(enum.Enum):
GENERAL = "general"
CONVERSATION = "conversation"
INSTRUCTION = "instruction"
PREFERENCE_OPENAI = "preference_openai"
JSONL_REQUIRED_COLUMNS_MAP = {
DatasetFormat.GENERAL: ["text"],
DatasetFormat.CONVERSATION: ["messages"],
DatasetFormat.INSTRUCTION: ["prompt", "completion"],
DatasetFormat.PREFERENCE_OPENAI: [
"input", "preferred_output", "non_preferred_output",
],
}
REQUIRED_COLUMNS_MESSAGE = ["role", "content"]
POSSIBLE_ROLES_CONVERSATION = ["system", "user", "assistant"]
Parquet column requirements from `src/together/constants.py:51`:
PARQUET_EXPECTED_COLUMNS = ["input_ids", "attention_mask", "labels"]
Multimodal limits from `src/together/constants.py:45-48`:
MAX_IMAGES_PER_EXAMPLE = 10
MAX_IMAGE_BYTES = 10 * 1024 * 1024 # 10MB
MAX_BASE64_IMAGE_LENGTH = len("data:image/jpeg;base64,") + 4 * MAX_IMAGE_BYTES // 3
Common Errors
| Error Message | Cause | Solution |
|---|---|---|
| `Maximum supported file size is 50.1 GB` | File exceeds size limit | Split dataset into smaller files |
| `File is empty` | Zero-byte file | Ensure file has content |
| `Processing {file} resulted in only {n} samples. Our minimum is 1 samples.` | Too few valid samples | Check JSONL format; each line must be valid JSON |
| `Messages in the conversation must be either all in multimodal or all in text-only format` | Mixed modality in single example | Use either all text or all multimodal messages per conversation |
| `The messages must contain at most 10 images` | Too many images per example | Reduce images to <= 10 per conversation |
| `The url field must be either a JPEG, PNG or WEBP base64-encoded image` | Invalid image format | Use `data:image/{format};base64,{data}` with JPEG, PNG, or WEBP |
Compatibility Notes
- JSONL Formats: Four distinct formats with different required columns. The SDK auto-detects format from column names.
- Conversation Messages: Each message must have `role` (system/user/assistant) and `content` fields.
- Parquet Files: Must contain exactly `input_ids`, `attention_mask`, and `labels` columns. Requires `pyarrow` to be installed.
- Multimodal: Images must be base64-encoded; max 10 images per example, max 10MB per image. Supported formats: JPEG, PNG, WEBP.
- Preference (DPO): The `preference_openai` format requires `input`, `preferred_output`, and `non_preferred_output` columns.