Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Principle:Togethercomputer Together python Dataset Preparation

From Leeroopedia
Attribute Value
Principle Name Dataset_Preparation
Overview Pattern for preparing training datasets in formats compatible with Together AI fine-tuning.
Domain MLOps, Fine_Tuning, Data_Preparation
Repository togethercomputer/together-python
Last Updated 2026-02-15 16:00 GMT

Description

Dataset preparation defines the required file formats and schemas for fine-tuning data on Together AI. The SDK supports three file formats -- JSONL, Parquet, and CSV -- each serving different purposes and carrying specific structural requirements. Properly formatting training data is a prerequisite to uploading and launching fine-tuning jobs.

Together AI recognizes four distinct JSONL dataset formats, each distinguished by their top-level JSON keys:

  • Conversational format -- Uses a messages key containing a list of role/content message dictionaries. Roles must alternate between user and assistant (with an optional leading system message). Each message requires role and content fields, and at least one assistant message must be present.
  • Instruction format -- Uses prompt and completion keys. The prompt represents the input instruction and the completion is the desired output.
  • General text format -- Uses a single text key containing the raw training text. Suitable for continued pretraining on unstructured text corpora.
  • DPO preference format -- Uses input, preferred_output, and non_preferred_output keys. The input field contains a messages list (without a trailing assistant message), while the output fields each contain a single-element list with an assistant message. This format is used for Direct Preference Optimization training.

Parquet format is used exclusively for pre-tokenized data. It requires an input_ids column and optionally supports attention_mask and labels columns. Pre-tokenized data bypasses the server-side tokenizer and is useful for advanced workflows such as sequence packing.

CSV format is supported only for evaluation purposes (not for fine-tuning training).

All formats enforce a minimum sample count (currently 1 sample), maximum file size limits (50.1 GB), and UTF-8 encoding for text-based files.

Usage

Use this principle before uploading training data to Together AI. The dataset must be prepared in one of the supported formats before the file can pass validation and be accepted for fine-tuning. The choice of format depends on the training method:

  • For SFT (Supervised Fine-Tuning): Use conversational, instruction, or general text formats.
  • For DPO (Direct Preference Optimization): Use the DPO preference format.
  • For pre-tokenized workflows: Use Parquet format with the tokenize_data.py example script as a reference.

Multimodal datasets (containing images) are supported within the conversational format by using the OpenAI-style content list structure with text and image_url content items. Images must be base64-encoded (JPEG, PNG, or WEBP), limited to 10 MB each, and at most 10 images per example.

Theoretical Basis

Fine-tuning requires structured training examples that the training pipeline can parse into input-output pairs for loss computation. The format determines how input masking (train_on_inputs) is applied:

  • Conversational format: {"messages": [{"role": "system", "content": "..."}, {"role": "user", "content": "..."}, {"role": "assistant", "content": "..."}]}
  • Instruction format: {"prompt": "...", "completion": "..."}
  • General text format: {"text": "..."}
  • DPO preference format: {"input": {"messages": [...]}, "preferred_output": [{"role": "assistant", "content": "..."}], "non_preferred_output": [{"role": "assistant", "content": "..."}]}

The format-to-column mapping is defined in src/together/constants.py via the JSONL_REQUIRED_COLUMNS_MAP dictionary, which maps each DatasetFormat enum value to its required columns. Extra columns beyond the required set are rejected to prevent silent data issues.

The conversational format additionally validates role sequences (must alternate user/assistant after an optional system message), message weights (must be integer 0 or 1), and content types (string or multimodal content list).

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment