Principle:Togethercomputer Together python Dataset Preparation
| Attribute | Value |
|---|---|
| Principle Name | Dataset_Preparation |
| Overview | Pattern for preparing training datasets in formats compatible with Together AI fine-tuning. |
| Domain | MLOps, Fine_Tuning, Data_Preparation |
| Repository | togethercomputer/together-python |
| Last Updated | 2026-02-15 16:00 GMT |
Description
Dataset preparation defines the required file formats and schemas for fine-tuning data on Together AI. The SDK supports three file formats -- JSONL, Parquet, and CSV -- each serving different purposes and carrying specific structural requirements. Properly formatting training data is a prerequisite to uploading and launching fine-tuning jobs.
Together AI recognizes four distinct JSONL dataset formats, each distinguished by their top-level JSON keys:
- Conversational format -- Uses a
messageskey containing a list of role/content message dictionaries. Roles must alternate betweenuserandassistant(with an optional leadingsystemmessage). Each message requiresroleandcontentfields, and at least oneassistantmessage must be present. - Instruction format -- Uses
promptandcompletionkeys. The prompt represents the input instruction and the completion is the desired output. - General text format -- Uses a single
textkey containing the raw training text. Suitable for continued pretraining on unstructured text corpora. - DPO preference format -- Uses
input,preferred_output, andnon_preferred_outputkeys. Theinputfield contains amessageslist (without a trailing assistant message), while the output fields each contain a single-element list with an assistant message. This format is used for Direct Preference Optimization training.
Parquet format is used exclusively for pre-tokenized data. It requires an input_ids column and optionally supports attention_mask and labels columns. Pre-tokenized data bypasses the server-side tokenizer and is useful for advanced workflows such as sequence packing.
CSV format is supported only for evaluation purposes (not for fine-tuning training).
All formats enforce a minimum sample count (currently 1 sample), maximum file size limits (50.1 GB), and UTF-8 encoding for text-based files.
Usage
Use this principle before uploading training data to Together AI. The dataset must be prepared in one of the supported formats before the file can pass validation and be accepted for fine-tuning. The choice of format depends on the training method:
- For SFT (Supervised Fine-Tuning): Use conversational, instruction, or general text formats.
- For DPO (Direct Preference Optimization): Use the DPO preference format.
- For pre-tokenized workflows: Use Parquet format with the tokenize_data.py example script as a reference.
Multimodal datasets (containing images) are supported within the conversational format by using the OpenAI-style content list structure with text and image_url content items. Images must be base64-encoded (JPEG, PNG, or WEBP), limited to 10 MB each, and at most 10 images per example.
Theoretical Basis
Fine-tuning requires structured training examples that the training pipeline can parse into input-output pairs for loss computation. The format determines how input masking (train_on_inputs) is applied:
- Conversational format:
{"messages": [{"role": "system", "content": "..."}, {"role": "user", "content": "..."}, {"role": "assistant", "content": "..."}]} - Instruction format:
{"prompt": "...", "completion": "..."} - General text format:
{"text": "..."} - DPO preference format:
{"input": {"messages": [...]}, "preferred_output": [{"role": "assistant", "content": "..."}], "non_preferred_output": [{"role": "assistant", "content": "..."}]}
The format-to-column mapping is defined in src/together/constants.py via the JSONL_REQUIRED_COLUMNS_MAP dictionary, which maps each DatasetFormat enum value to its required columns. Extra columns beyond the required set are rejected to prevent silent data issues.
The conversational format additionally validates role sequences (must alternate user/assistant after an optional system message), message weights (must be integer 0 or 1), and content types (string or multimodal content list).