Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Togethercomputer Together python Check File

From Leeroopedia
Attribute Value
Implementation Name Check_File
Type Utility Function
Source src/together/utils/files.py:L52-115 (main entry), plus helper functions
Domain MLOps, Fine_Tuning, Data_Preparation
Repository togethercomputer/together-python
Last Updated 2026-02-15 16:00 GMT

API Signature

def check_file(
    file: Path | str,
    purpose: FilePurpose | str = FilePurpose.FineTune,
) -> Dict[str, Any]:

Import

from together.utils import check_file

I/O Contract

Inputs

Parameter Type Default Description
file str (required) Path to the local dataset file to validate.
purpose str FilePurpose.FineTune The intended purpose of the file. Affects which validations are applied. Supported values: "fine-tune", "eval".

Output

Returns a Dict[str, Any] with the following keys:

Key Type Description
is_check_passed bool Overall validation result. True if all checks passed.
message str Human-readable status message. "Checks passed" on success, or a description of the first failure encountered.
found None Whether the file was found at the specified path.
file_size None File size in bytes, or None if the file was not found.
utf8 None Whether the file is valid UTF-8 (JSONL and CSV only).
line_type None Whether each line is a valid JSON object (JSONL only).
text_field None Whether required text fields are present.
key_value None Whether required keys and value formats are correct.
has_min_samples None Whether the file meets the minimum sample count requirement.
num_samples None Total number of samples in the file.
load_json None Whether the JSONL file loaded successfully.
load_parquet str | None Whether the Parquet file loaded successfully, or an error string.
load_csv None Whether the CSV file loaded successfully.
filetype str Detected file type: "jsonl", "parquet", "csv", or an error message for unknown extensions.

Code Reference

The main entry point at src/together/utils/files.py:L52-115:

def check_file(
    file: Path | str,
    purpose: FilePurpose | str = FilePurpose.FineTune,
) -> Dict[str, Any]:
    if not isinstance(file, Path):
        file = Path(file)

    report_dict = {
        "is_check_passed": True,
        "message": "Checks passed",
        "found": None,
        "file_size": None,
        "utf8": None,
        "line_type": None,
        "text_field": None,
        "key_value": None,
        "has_min_samples": None,
        "num_samples": None,
        "load_json": None,
        "load_csv": None,
    }

    if not file.is_file():
        report_dict["found"] = False
        report_dict["is_check_passed"] = False
        return report_dict
    else:
        report_dict["found"] = True

    file_size = os.stat(file).st_size

    if file_size > MAX_FILE_SIZE_GB * NUM_BYTES_IN_GB:
        report_dict["message"] = (
            f"Maximum supported file size is {MAX_FILE_SIZE_GB} GB. ..."
        )
        report_dict["is_check_passed"] = False
    elif file_size == 0:
        report_dict["message"] = "File is empty"
        report_dict["file_size"] = 0
        report_dict["is_check_passed"] = False
        return report_dict
    else:
        report_dict["file_size"] = file_size

    # Dispatch to format-specific validators
    if file.suffix == ".jsonl":
        report_dict["filetype"] = "jsonl"
        data_report_dict = _check_jsonl(file, purpose)
    elif file.suffix == ".parquet":
        report_dict["filetype"] = "parquet"
        data_report_dict = _check_parquet(file, purpose)
    elif file.suffix == ".csv":
        report_dict["filetype"] = "csv"
        data_report_dict = _check_csv(file, purpose)
    else:
        report_dict["filetype"] = f"Unknown extension of file {file}. ..."
        report_dict["is_check_passed"] = False

    report_dict.update(data_report_dict)
    return report_dict

Internal Helper Functions

The format-specific validators are private functions in the same module:

  • _check_jsonl(file, purpose) -- Validates UTF-8 encoding, parses each JSON line, detects the dataset format from JSONL_REQUIRED_COLUMNS_MAP, rejects extra columns, and delegates to content validators (validate_messages(), validate_preference_openai()).
  • _check_parquet(file, purpose) -- Loads the Parquet file via PyArrow, checks for the required input_ids column, rejects unexpected columns, and verifies minimum sample count.
  • _check_csv(file, purpose) -- Validates that the purpose is eval (CSV is not supported for fine-tuning), checks UTF-8, and validates row consistency against the header.
  • _check_utf8(file) -- Iterates through the file with UTF-8 encoding to detect encoding errors.
  • _check_samples_count(file, report_dict, idx) -- Verifies the sample count meets MIN_SAMPLES.
  • validate_messages(messages, idx, require_assistant_role) -- Validates conversational message structure: types, roles, content, weights, and multimodal constraints.
  • validate_preference_openai(example, idx) -- Validates DPO preference format structure.

Usage Examples

Basic File Validation

from together.utils import check_file

# Validate a JSONL file for fine-tuning
report = check_file("training_data.jsonl")

if report["is_check_passed"]:
    print(f"Validation passed! {report['num_samples']} samples found.")
else:
    print(f"Validation failed: {report['message']}")

Validation with Custom Purpose

from together.utils import check_file
from together.types import FilePurpose

# Validate a CSV file for evaluation
report = check_file("eval_data.csv", purpose=FilePurpose.Eval)
print(report)

Automatic Validation During Upload

from together import Together

client = Together()

# check=True (default) automatically calls check_file() before upload
# Raises FileTypeError if validation fails
response = client.files.upload("training_data.jsonl", check=True)
print(f"Uploaded: {response.id}")

Inspecting a Failed Validation Report

from together.utils import check_file

report = check_file("bad_data.jsonl")
# Example output for a file with missing required keys:
# {
#     "is_check_passed": False,
#     "message": "Error parsing file. Could not detect a format for the line 3...",
#     "found": True,
#     "file_size": 1024,
#     "utf8": True,
#     "line_type": True,
#     "text_field": True,
#     "key_value": True,
#     "has_min_samples": None,
#     "num_samples": None,
#     "load_json": False,
#     "filetype": "jsonl"
# }

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment