Implementation:Togethercomputer Together python Check File
Appearance
| Attribute | Value |
|---|---|
| Implementation Name | Check_File |
| Type | Utility Function |
| Source | src/together/utils/files.py:L52-115 (main entry), plus helper functions |
| Domain | MLOps, Fine_Tuning, Data_Preparation |
| Repository | togethercomputer/together-python |
| Last Updated | 2026-02-15 16:00 GMT |
API Signature
def check_file(
file: Path | str,
purpose: FilePurpose | str = FilePurpose.FineTune,
) -> Dict[str, Any]:
Import
from together.utils import check_file
I/O Contract
Inputs
| Parameter | Type | Default | Description |
|---|---|---|---|
file |
str | (required) | Path to the local dataset file to validate. |
purpose |
str | FilePurpose.FineTune |
The intended purpose of the file. Affects which validations are applied. Supported values: "fine-tune", "eval".
|
Output
Returns a Dict[str, Any] with the following keys:
| Key | Type | Description |
|---|---|---|
is_check_passed |
bool |
Overall validation result. True if all checks passed.
|
message |
str |
Human-readable status message. "Checks passed" on success, or a description of the first failure encountered.
|
found |
None | Whether the file was found at the specified path. |
file_size |
None | File size in bytes, or None if the file was not found.
|
utf8 |
None | Whether the file is valid UTF-8 (JSONL and CSV only). |
line_type |
None | Whether each line is a valid JSON object (JSONL only). |
text_field |
None | Whether required text fields are present. |
key_value |
None | Whether required keys and value formats are correct. |
has_min_samples |
None | Whether the file meets the minimum sample count requirement. |
num_samples |
None | Total number of samples in the file. |
load_json |
None | Whether the JSONL file loaded successfully. |
load_parquet |
str | None | Whether the Parquet file loaded successfully, or an error string. |
load_csv |
None | Whether the CSV file loaded successfully. |
filetype |
str |
Detected file type: "jsonl", "parquet", "csv", or an error message for unknown extensions.
|
Code Reference
The main entry point at src/together/utils/files.py:L52-115:
def check_file(
file: Path | str,
purpose: FilePurpose | str = FilePurpose.FineTune,
) -> Dict[str, Any]:
if not isinstance(file, Path):
file = Path(file)
report_dict = {
"is_check_passed": True,
"message": "Checks passed",
"found": None,
"file_size": None,
"utf8": None,
"line_type": None,
"text_field": None,
"key_value": None,
"has_min_samples": None,
"num_samples": None,
"load_json": None,
"load_csv": None,
}
if not file.is_file():
report_dict["found"] = False
report_dict["is_check_passed"] = False
return report_dict
else:
report_dict["found"] = True
file_size = os.stat(file).st_size
if file_size > MAX_FILE_SIZE_GB * NUM_BYTES_IN_GB:
report_dict["message"] = (
f"Maximum supported file size is {MAX_FILE_SIZE_GB} GB. ..."
)
report_dict["is_check_passed"] = False
elif file_size == 0:
report_dict["message"] = "File is empty"
report_dict["file_size"] = 0
report_dict["is_check_passed"] = False
return report_dict
else:
report_dict["file_size"] = file_size
# Dispatch to format-specific validators
if file.suffix == ".jsonl":
report_dict["filetype"] = "jsonl"
data_report_dict = _check_jsonl(file, purpose)
elif file.suffix == ".parquet":
report_dict["filetype"] = "parquet"
data_report_dict = _check_parquet(file, purpose)
elif file.suffix == ".csv":
report_dict["filetype"] = "csv"
data_report_dict = _check_csv(file, purpose)
else:
report_dict["filetype"] = f"Unknown extension of file {file}. ..."
report_dict["is_check_passed"] = False
report_dict.update(data_report_dict)
return report_dict
Internal Helper Functions
The format-specific validators are private functions in the same module:
_check_jsonl(file, purpose)-- Validates UTF-8 encoding, parses each JSON line, detects the dataset format fromJSONL_REQUIRED_COLUMNS_MAP, rejects extra columns, and delegates to content validators (validate_messages(),validate_preference_openai())._check_parquet(file, purpose)-- Loads the Parquet file via PyArrow, checks for the requiredinput_idscolumn, rejects unexpected columns, and verifies minimum sample count._check_csv(file, purpose)-- Validates that the purpose iseval(CSV is not supported for fine-tuning), checks UTF-8, and validates row consistency against the header._check_utf8(file)-- Iterates through the file with UTF-8 encoding to detect encoding errors._check_samples_count(file, report_dict, idx)-- Verifies the sample count meetsMIN_SAMPLES.validate_messages(messages, idx, require_assistant_role)-- Validates conversational message structure: types, roles, content, weights, and multimodal constraints.validate_preference_openai(example, idx)-- Validates DPO preference format structure.
Usage Examples
Basic File Validation
from together.utils import check_file
# Validate a JSONL file for fine-tuning
report = check_file("training_data.jsonl")
if report["is_check_passed"]:
print(f"Validation passed! {report['num_samples']} samples found.")
else:
print(f"Validation failed: {report['message']}")
Validation with Custom Purpose
from together.utils import check_file
from together.types import FilePurpose
# Validate a CSV file for evaluation
report = check_file("eval_data.csv", purpose=FilePurpose.Eval)
print(report)
Automatic Validation During Upload
from together import Together
client = Together()
# check=True (default) automatically calls check_file() before upload
# Raises FileTypeError if validation fails
response = client.files.upload("training_data.jsonl", check=True)
print(f"Uploaded: {response.id}")
Inspecting a Failed Validation Report
from together.utils import check_file
report = check_file("bad_data.jsonl")
# Example output for a file with missing required keys:
# {
# "is_check_passed": False,
# "message": "Error parsing file. Could not detect a format for the line 3...",
# "found": True,
# "file_size": 1024,
# "utf8": True,
# "line_type": True,
# "text_field": True,
# "key_value": True,
# "has_min_samples": None,
# "num_samples": None,
# "load_json": False,
# "filetype": "jsonl"
# }
Related Pages
Page Connections
Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment