Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Togethercomputer Together python Fine Tuning Dataset Format

From Leeroopedia
Attribute Value
Implementation Name Fine_Tuning_Dataset_Format
Type Pattern Doc
Source examples/tokenize_data.py:L1-239, src/together/constants.py:L1-74
Domain MLOps, Fine_Tuning, Data_Preparation
Repository togethercomputer/together-python
Last Updated 2026-02-15 16:00 GMT

Overview

This is a Pattern Doc implementation. Users construct JSONL, Parquet, or CSV files following the schemas defined in the Together Python SDK. The SDK does not generate these files programmatically (except for the pre-tokenization example); instead it validates that user-prepared files conform to the required schemas before upload.

Code Reference

The format definitions are centralized in src/together/constants.py:

class DatasetFormat(enum.Enum):
    """Dataset format enum."""
    GENERAL = "general"
    CONVERSATION = "conversation"
    INSTRUCTION = "instruction"
    PREFERENCE_OPENAI = "preference_openai"

JSONL_REQUIRED_COLUMNS_MAP = {
    DatasetFormat.GENERAL: ["text"],
    DatasetFormat.CONVERSATION: ["messages"],
    DatasetFormat.INSTRUCTION: ["prompt", "completion"],
    DatasetFormat.PREFERENCE_OPENAI: [
        "input",
        "preferred_output",
        "non_preferred_output",
    ],
}
REQUIRED_COLUMNS_MESSAGE = ["role", "content"]
POSSIBLE_ROLES_CONVERSATION = ["system", "user", "assistant"]

# Parquet format
PARQUET_EXPECTED_COLUMNS = ["input_ids", "attention_mask", "labels"]

# Limits
MIN_SAMPLES = 1
MAX_FILE_SIZE_GB = 50.1
MAX_IMAGES_PER_EXAMPLE = 10
MAX_IMAGE_BYTES = 10 * 1024 * 1024  # 10MB

JSONL Formats

1. Conversational Format

Each line is a JSON object with a messages key containing a list of role/content dictionaries:

# Each line of the .jsonl file:
{"messages": [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "What is the capital of France?"},
    {"role": "assistant", "content": "The capital of France is Paris."}
]}

Rules:

  • Each message must contain role and content keys.
  • Valid roles: system, user, assistant.
  • Roles must alternate between user and assistant after an optional leading system message.
  • At least one assistant message must be present (for fine-tuning purpose).
  • Messages may optionally include a weight field (integer, 0 or 1) to control loss masking.
  • No extra columns beyond messages are allowed.

Multimodal variant (images in conversational format):

{"messages": [
    {"role": "user", "content": [
        {"type": "text", "text": "What is in this image?"},
        {"type": "image_url", "image_url": {
            "url": "data:image/jpeg;base64,/9j/4AAQSkZ..."
        }}
    ]},
    {"role": "assistant", "content": "This image shows a cat."}
]}

2. Instruction Format

Each line contains prompt and completion keys:

{"prompt": "Translate the following English sentence to French: 'Hello, how are you?'",
 "completion": "Bonjour, comment allez-vous ?"}

Rules:

  • Both prompt and completion must be present.
  • No extra columns are allowed.

3. General Text Format

Each line contains a single text key:

{"text": "The quick brown fox jumps over the lazy dog. This is a sample training text for continued pretraining."}

Rules:

  • The text field must be a string.
  • No extra columns are allowed.

4. DPO Preference Format

Each line contains input, preferred_output, and non_preferred_output keys for Direct Preference Optimization:

{
    "input": {
        "messages": [
            {"role": "user", "content": "Write a poem about the sea."}
        ]
    },
    "preferred_output": [
        {"role": "assistant", "content": "The ocean whispers to the shore..."}
    ],
    "non_preferred_output": [
        {"role": "assistant", "content": "The sea is big and blue."}
    ]
}

Rules:

  • input must be a dictionary containing a messages list.
  • The last message in input.messages must not be from the assistant role.
  • preferred_output and non_preferred_output must each be a list containing exactly one dictionary with role set to "assistant" and a content field.

Parquet Format (Pre-Tokenized)

Parquet files are used for pre-tokenized data. The required column is input_ids, with optional attention_mask and labels columns.

The SDK provides a reference script at examples/tokenize_data.py for preparing pre-tokenized datasets:

# Example: Tokenize a Hugging Face dataset into Parquet format
# python examples/tokenize_data.py \
#   --dataset clam004/antihallucination_dataset \
#   --tokenizer togethercomputer/Llama-3-8b-hf \
#   --max-seq-length 8192 \
#   --add-labels \
#   --packing \
#   --out-filename processed_dataset.parquet

from datasets import load_dataset
from transformers import AutoTokenizer

dataset = load_dataset("clam004/antihallucination_dataset", split="train")
tokenizer = AutoTokenizer.from_pretrained("togethercomputer/Llama-3-8b-hf")
tokenizer.pad_token = tokenizer.eos_token

# Tokenize with constant length (truncation + padding)
def tokenize_constant_length(data, tokenizer, max_length=2048):
    tokenized = tokenizer(
        data["text"],
        max_length=max_length,
        truncation=True,
        padding="max_length",
        add_special_tokens=True,
    )
    # Add labels masking padding tokens
    LOSS_IGNORE_INDEX = -100
    tokenized["labels"] = [
        LOSS_IGNORE_INDEX if token_id == tokenizer.pad_token_id else token_id
        for token_id in tokenized["input_ids"]
    ]
    return tokenized

tokenized_data = dataset.map(tokenize_constant_length, ...)
tokenized_data.to_parquet("processed_dataset.parquet")

The script also supports sequence packing via the --packing flag, which concatenates shorter sequences to fill the maximum sequence length, improving training throughput.

CSV Format (Evaluation Only)

CSV files are accepted only when the file purpose is set to eval. They are not supported for fine-tuning training. The CSV must be well-formed with a consistent header row and no mismatched column counts.

Usage Examples

Preparing a Conversational Dataset

import json

# Prepare training data
conversations = [
    {"messages": [
        {"role": "system", "content": "You are a coding assistant."},
        {"role": "user", "content": "How do I reverse a list in Python?"},
        {"role": "assistant", "content": "You can use my_list[::-1] or my_list.reverse()."}
    ]},
    {"messages": [
        {"role": "user", "content": "What is a dictionary in Python?"},
        {"role": "assistant", "content": "A dictionary is a collection of key-value pairs."}
    ]},
]

# Write to JSONL file
with open("training_data.jsonl", "w") as f:
    for conv in conversations:
        f.write(json.dumps(conv) + "\n")

Preparing a DPO Preference Dataset

import json

preferences = [
    {
        "input": {"messages": [
            {"role": "user", "content": "Explain quantum computing."}
        ]},
        "preferred_output": [
            {"role": "assistant", "content": "Quantum computing uses qubits..."}
        ],
        "non_preferred_output": [
            {"role": "assistant", "content": "It is about computers."}
        ]
    },
]

with open("dpo_data.jsonl", "w") as f:
    for pref in preferences:
        f.write(json.dumps(pref) + "\n")

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment