Environment:Togethercomputer Together python Fine Tuning Data Requirements

Knowledge Sources	Together Python SDK Together Fine-Tuning Docs
Domains	Fine_Tuning, Data_Validation
Last Updated	2026-02-15 16:00 GMT

Overview

File format, size, and structure requirements for datasets used in Together AI fine-tuning jobs.

Description

Together AI fine-tuning accepts datasets in JSONL or Parquet format. JSONL files must conform to one of four schema types (general, conversation, instruction, preference). Parquet files must contain specific columns. All files are validated client-side by the `check_file` utility before upload, with strict constraints on file size, sample count, and multimodal content.

Usage

Use this environment specification when preparing datasets for the Fine-Tuning workflow. The `check_file()` utility validates files against these requirements before upload.

System Requirements

Category	Requirement	Notes
File Size	Maximum 50.1 GB	Per-file limit for fine-tuning uploads
File Size	Minimum > 0 bytes	Empty files are rejected
Samples	Minimum 1 sample	At least 1 valid example required
Disk	Sufficient for dataset	Plus ~2x for Parquet conversion if applicable

Dependencies

Python Packages

`together` SDK (core) — For `check_file` validation
`pyarrow` >=10.0.1 — Required only for Parquet format files; install via `pip install together[pyarrow]`

Credentials

No credentials required for local file validation. `TOGETHER_API_KEY` is required for the subsequent upload step.

Quick Install

# Core SDK
pip install together

# With Parquet support
pip install "together[pyarrow]"

Code Evidence

File size validation from `src/together/utils/files.py:83-90`:

file_size = os.stat(file).st_size

if file_size > MAX_FILE_SIZE_GB * NUM_BYTES_IN_GB:
    report_dict["message"] = (
        f"Maximum supported file size is {MAX_FILE_SIZE_GB} GB. "
        f"Found file with size of {round(file_size / NUM_BYTES_IN_GB, 3)} GB."
    )
    report_dict["is_check_passed"] = False
elif file_size == 0:
    report_dict["message"] = "File is empty"
    report_dict["is_check_passed"] = False

JSONL format definitions from `src/together/constants.py:54-74`:

class DatasetFormat(enum.Enum):
    GENERAL = "general"
    CONVERSATION = "conversation"
    INSTRUCTION = "instruction"
    PREFERENCE_OPENAI = "preference_openai"

JSONL_REQUIRED_COLUMNS_MAP = {
    DatasetFormat.GENERAL: ["text"],
    DatasetFormat.CONVERSATION: ["messages"],
    DatasetFormat.INSTRUCTION: ["prompt", "completion"],
    DatasetFormat.PREFERENCE_OPENAI: [
        "input", "preferred_output", "non_preferred_output",
    ],
}
REQUIRED_COLUMNS_MESSAGE = ["role", "content"]
POSSIBLE_ROLES_CONVERSATION = ["system", "user", "assistant"]

Parquet column requirements from `src/together/constants.py:51`:

PARQUET_EXPECTED_COLUMNS = ["input_ids", "attention_mask", "labels"]

Multimodal limits from `src/together/constants.py:45-48`:

MAX_IMAGES_PER_EXAMPLE = 10
MAX_IMAGE_BYTES = 10 * 1024 * 1024  # 10MB
MAX_BASE64_IMAGE_LENGTH = len("data:image/jpeg;base64,") + 4 * MAX_IMAGE_BYTES // 3

Common Errors

Error Message	Cause	Solution
`Maximum supported file size is 50.1 GB`	File exceeds size limit	Split dataset into smaller files
`File is empty`	Zero-byte file	Ensure file has content
`Processing {file} resulted in only {n} samples. Our minimum is 1 samples.`	Too few valid samples	Check JSONL format; each line must be valid JSON
`Messages in the conversation must be either all in multimodal or all in text-only format`	Mixed modality in single example	Use either all text or all multimodal messages per conversation
`The messages must contain at most 10 images`	Too many images per example	Reduce images to <= 10 per conversation
`The url field must be either a JPEG, PNG or WEBP base64-encoded image`	Invalid image format	Use `data:image/{format};base64,{data}` with JPEG, PNG, or WEBP

Compatibility Notes

JSONL Formats: Four distinct formats with different required columns. The SDK auto-detects format from column names.
Conversation Messages: Each message must have `role` (system/user/assistant) and `content` fields.
Parquet Files: Must contain exactly `input_ids`, `attention_mask`, and `labels` columns. Requires `pyarrow` to be installed.
Multimodal: Images must be base64-encoded; max 10 images per example, max 10MB per image. Supported formats: JPEG, PNG, WEBP.
Preference (DPO): The `preference_openai` format requires `input`, `preferred_output`, and `non_preferred_output` columns.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment