Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Principle:Togethercomputer Together python Dataset Validation

From Leeroopedia
Attribute Value
Principle Name Dataset_Validation
Overview Mechanism for validating training dataset files before upload to ensure compatibility with fine-tuning requirements.
Domain MLOps, Fine_Tuning, Data_Preparation
Repository togethercomputer/together-python
Last Updated 2026-02-15 16:00 GMT

Description

Dataset validation runs a comprehensive check pipeline on local files before they are uploaded to Together AI for fine-tuning. The validation process is designed to catch formatting and structural errors early, preventing failed uploads and wasted time.

The validation pipeline performs the following checks in sequence:

  1. File existence -- Confirms the file exists at the specified path.
  2. File size -- Verifies the file is non-empty and does not exceed the maximum supported size (50.1 GB).
  3. File type detection -- Determines the format based on file extension (.jsonl, .parquet, or .csv). Unknown extensions cause validation failure.
  4. Format-specific validation:
    • JSONL files: UTF-8 encoding check, JSON parsing of each line, dataset format detection (conversational, instruction, general text, or DPO preference), required column verification, extra column rejection, and content-specific validation (role sequences, message structure, multimodal image constraints).
    • Parquet files: Schema validation ensuring the input_ids column exists, rejection of unexpected columns (only input_ids, attention_mask, and labels are allowed), and minimum sample count.
    • CSV files: Only accepted for evaluation purpose; rejected for fine-tuning. Validates UTF-8 encoding, header presence, and row consistency.
  5. Minimum sample count -- Ensures the dataset contains at least the minimum required number of samples (currently 1).
  6. Content integrity -- For multimodal datasets, validates that images are base64-encoded in supported formats (JPEG, PNG, WEBP), do not exceed 10 MB each, and that no example contains more than 10 images.

The validation produces a structured report dictionary that summarizes the results of each check, enabling programmatic inspection of what passed and what failed.

Usage

Use this principle after preparing a dataset file and before uploading it to Together AI. The validation is:

  • Automatically invoked by Files.upload() when the check=True parameter is set (which is the default). If validation fails, a FileTypeError is raised and the upload is aborted.
  • Manually invocable via the check_file() utility function for standalone validation without triggering an upload.

The validation is purpose-aware: when the file purpose is FilePurpose.FineTune, full format validation is applied. When the purpose is FilePurpose.Eval, format-specific content validation (column schemas for JSONL) is relaxed, and CSV files are accepted.

Theoretical Basis

Client-side validation follows the fail-fast design principle: catching errors before network transfer saves time and bandwidth. The validation pipeline mirrors the server-side expectations so that a file passing local validation will be accepted by the Together API.

The check pipeline uses a progressive validation strategy -- each check builds on the previous one. For example, UTF-8 validation occurs before JSON parsing, and JSON parsing occurs before schema validation. This ensures that error messages are specific and actionable rather than cascading from an earlier root cause.

For JSONL files, the validator detects the dataset format from the keys present in each JSON line, then applies format-specific rules. The format must be consistent across all lines in the file -- mixing formats (e.g., some lines with messages and others with prompt/completion) is rejected.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment