Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Environment:Togethercomputer Together python Fine Tuning Data Requirements

From Leeroopedia
Knowledge Sources
Domains Fine_Tuning, Data_Validation
Last Updated 2026-02-15 16:00 GMT

Overview

File format, size, and structure requirements for datasets used in Together AI fine-tuning jobs.

Description

Together AI fine-tuning accepts datasets in JSONL or Parquet format. JSONL files must conform to one of four schema types (general, conversation, instruction, preference). Parquet files must contain specific columns. All files are validated client-side by the `check_file` utility before upload, with strict constraints on file size, sample count, and multimodal content.

Usage

Use this environment specification when preparing datasets for the Fine-Tuning workflow. The `check_file()` utility validates files against these requirements before upload.

System Requirements

Category Requirement Notes
File Size Maximum 50.1 GB Per-file limit for fine-tuning uploads
File Size Minimum > 0 bytes Empty files are rejected
Samples Minimum 1 sample At least 1 valid example required
Disk Sufficient for dataset Plus ~2x for Parquet conversion if applicable

Dependencies

Python Packages

  • `together` SDK (core) — For `check_file` validation
  • `pyarrow` >=10.0.1 — Required only for Parquet format files; install via `pip install together[pyarrow]`

Credentials

No credentials required for local file validation. `TOGETHER_API_KEY` is required for the subsequent upload step.

Quick Install

# Core SDK
pip install together

# With Parquet support
pip install "together[pyarrow]"

Code Evidence

File size validation from `src/together/utils/files.py:83-90`:

file_size = os.stat(file).st_size

if file_size > MAX_FILE_SIZE_GB * NUM_BYTES_IN_GB:
    report_dict["message"] = (
        f"Maximum supported file size is {MAX_FILE_SIZE_GB} GB. "
        f"Found file with size of {round(file_size / NUM_BYTES_IN_GB, 3)} GB."
    )
    report_dict["is_check_passed"] = False
elif file_size == 0:
    report_dict["message"] = "File is empty"
    report_dict["is_check_passed"] = False

JSONL format definitions from `src/together/constants.py:54-74`:

class DatasetFormat(enum.Enum):
    GENERAL = "general"
    CONVERSATION = "conversation"
    INSTRUCTION = "instruction"
    PREFERENCE_OPENAI = "preference_openai"

JSONL_REQUIRED_COLUMNS_MAP = {
    DatasetFormat.GENERAL: ["text"],
    DatasetFormat.CONVERSATION: ["messages"],
    DatasetFormat.INSTRUCTION: ["prompt", "completion"],
    DatasetFormat.PREFERENCE_OPENAI: [
        "input", "preferred_output", "non_preferred_output",
    ],
}
REQUIRED_COLUMNS_MESSAGE = ["role", "content"]
POSSIBLE_ROLES_CONVERSATION = ["system", "user", "assistant"]

Parquet column requirements from `src/together/constants.py:51`:

PARQUET_EXPECTED_COLUMNS = ["input_ids", "attention_mask", "labels"]

Multimodal limits from `src/together/constants.py:45-48`:

MAX_IMAGES_PER_EXAMPLE = 10
MAX_IMAGE_BYTES = 10 * 1024 * 1024  # 10MB
MAX_BASE64_IMAGE_LENGTH = len("data:image/jpeg;base64,") + 4 * MAX_IMAGE_BYTES // 3

Common Errors

Error Message Cause Solution
`Maximum supported file size is 50.1 GB` File exceeds size limit Split dataset into smaller files
`File is empty` Zero-byte file Ensure file has content
`Processing {file} resulted in only {n} samples. Our minimum is 1 samples.` Too few valid samples Check JSONL format; each line must be valid JSON
`Messages in the conversation must be either all in multimodal or all in text-only format` Mixed modality in single example Use either all text or all multimodal messages per conversation
`The messages must contain at most 10 images` Too many images per example Reduce images to <= 10 per conversation
`The url field must be either a JPEG, PNG or WEBP base64-encoded image` Invalid image format Use `data:image/{format};base64,{data}` with JPEG, PNG, or WEBP

Compatibility Notes

  • JSONL Formats: Four distinct formats with different required columns. The SDK auto-detects format from column names.
  • Conversation Messages: Each message must have `role` (system/user/assistant) and `content` fields.
  • Parquet Files: Must contain exactly `input_ids`, `attention_mask`, and `labels` columns. Requires `pyarrow` to be installed.
  • Multimodal: Images must be base64-encoded; max 10 images per example, max 10MB per image. Supported formats: JPEG, PNG, WEBP.
  • Preference (DPO): The `preference_openai` format requires `input`, `preferred_output`, and `non_preferred_output` columns.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment