Implementation:Togethercomputer Together python Check File

Attribute	Value
Implementation Name	Check_File
Type	Utility Function
Source	src/together/utils/files.py:L52-115 (main entry), plus helper functions
Domain	MLOps, Fine_Tuning, Data_Preparation
Repository	togethercomputer/together-python
Last Updated	2026-02-15 16:00 GMT

API Signature

def check_file(
    file: Path | str,
    purpose: FilePurpose | str = FilePurpose.FineTune,
) -> Dict[str, Any]:

Import

from together.utils import check_file

I/O Contract

Inputs

Parameter	Type	Default	Description
`file`	str	(required)	Path to the local dataset file to validate.
`purpose`	str	`FilePurpose.FineTune`	The intended purpose of the file. Affects which validations are applied. Supported values: `"fine-tune"`, `"eval"`.

Output

Returns a Dict[str, Any] with the following keys:

Key	Type	Description
`is_check_passed`	`bool`	Overall validation result. `True` if all checks passed.
`message`	`str`	Human-readable status message. `"Checks passed"` on success, or a description of the first failure encountered.
`found`	None	Whether the file was found at the specified path.
`file_size`	None	File size in bytes, or `None` if the file was not found.
`utf8`	None	Whether the file is valid UTF-8 (JSONL and CSV only).
`line_type`	None	Whether each line is a valid JSON object (JSONL only).
`text_field`	None	Whether required text fields are present.
`key_value`	None	Whether required keys and value formats are correct.
`has_min_samples`	None	Whether the file meets the minimum sample count requirement.
`num_samples`	None	Total number of samples in the file.
`load_json`	None	Whether the JSONL file loaded successfully.
`load_parquet`	str \| None	Whether the Parquet file loaded successfully, or an error string.
`load_csv`	None	Whether the CSV file loaded successfully.
`filetype`	`str`	Detected file type: `"jsonl"`, `"parquet"`, `"csv"`, or an error message for unknown extensions.

Code Reference

The main entry point at src/together/utils/files.py:L52-115:

def check_file(
    file: Path | str,
    purpose: FilePurpose | str = FilePurpose.FineTune,
) -> Dict[str, Any]:
    if not isinstance(file, Path):
        file = Path(file)

    report_dict = {
        "is_check_passed": True,
        "message": "Checks passed",
        "found": None,
        "file_size": None,
        "utf8": None,
        "line_type": None,
        "text_field": None,
        "key_value": None,
        "has_min_samples": None,
        "num_samples": None,
        "load_json": None,
        "load_csv": None,
    }

    if not file.is_file():
        report_dict["found"] = False
        report_dict["is_check_passed"] = False
        return report_dict
    else:
        report_dict["found"] = True

    file_size = os.stat(file).st_size

    if file_size > MAX_FILE_SIZE_GB * NUM_BYTES_IN_GB:
        report_dict["message"] = (
            f"Maximum supported file size is {MAX_FILE_SIZE_GB} GB. ..."
        )
        report_dict["is_check_passed"] = False
    elif file_size == 0:
        report_dict["message"] = "File is empty"
        report_dict["file_size"] = 0
        report_dict["is_check_passed"] = False
        return report_dict
    else:
        report_dict["file_size"] = file_size

    # Dispatch to format-specific validators
    if file.suffix == ".jsonl":
        report_dict["filetype"] = "jsonl"
        data_report_dict = _check_jsonl(file, purpose)
    elif file.suffix == ".parquet":
        report_dict["filetype"] = "parquet"
        data_report_dict = _check_parquet(file, purpose)
    elif file.suffix == ".csv":
        report_dict["filetype"] = "csv"
        data_report_dict = _check_csv(file, purpose)
    else:
        report_dict["filetype"] = f"Unknown extension of file {file}. ..."
        report_dict["is_check_passed"] = False

    report_dict.update(data_report_dict)
    return report_dict

Internal Helper Functions

The format-specific validators are private functions in the same module:

_check_jsonl(file, purpose) -- Validates UTF-8 encoding, parses each JSON line, detects the dataset format from JSONL_REQUIRED_COLUMNS_MAP, rejects extra columns, and delegates to content validators (validate_messages(), validate_preference_openai()).
_check_parquet(file, purpose) -- Loads the Parquet file via PyArrow, checks for the required input_ids column, rejects unexpected columns, and verifies minimum sample count.
_check_csv(file, purpose) -- Validates that the purpose is eval (CSV is not supported for fine-tuning), checks UTF-8, and validates row consistency against the header.
_check_utf8(file) -- Iterates through the file with UTF-8 encoding to detect encoding errors.
_check_samples_count(file, report_dict, idx) -- Verifies the sample count meets MIN_SAMPLES.
validate_messages(messages, idx, require_assistant_role) -- Validates conversational message structure: types, roles, content, weights, and multimodal constraints.
validate_preference_openai(example, idx) -- Validates DPO preference format structure.

Usage Examples

Basic File Validation

from together.utils import check_file

# Validate a JSONL file for fine-tuning
report = check_file("training_data.jsonl")

if report["is_check_passed"]:
    print(f"Validation passed! {report['num_samples']} samples found.")
else:
    print(f"Validation failed: {report['message']}")

Validation with Custom Purpose

from together.utils import check_file
from together.types import FilePurpose

# Validate a CSV file for evaluation
report = check_file("eval_data.csv", purpose=FilePurpose.Eval)
print(report)

Automatic Validation During Upload

from together import Together

client = Together()

# check=True (default) automatically calls check_file() before upload
# Raises FileTypeError if validation fails
response = client.files.upload("training_data.jsonl", check=True)
print(f"Uploaded: {response.id}")

Inspecting a Failed Validation Report

from together.utils import check_file

report = check_file("bad_data.jsonl")
# Example output for a file with missing required keys:
# {
#     "is_check_passed": False,
#     "message": "Error parsing file. Could not detect a format for the line 3...",
#     "found": True,
#     "file_size": 1024,
#     "utf8": True,
#     "line_type": True,
#     "text_field": True,
#     "key_value": True,
#     "has_min_samples": None,
#     "num_samples": None,
#     "load_json": False,
#     "filetype": "jsonl"
# }

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment