Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Principle:Run llama Llama index Training Data Validation

From Leeroopedia

Overview

Training Data Validation is a critical quality gate in the LLM finetuning pipeline that sits between data collection and job submission. Before uploading training data to the OpenAI finetuning API, the dataset must be validated to ensure structural correctness, format compliance, and reasonable cost expectations. Submitting malformed training data results in wasted API calls, failed jobs, and potentially corrupted finetuned models.

LlamaIndex provides a validation utility adapted directly from OpenAI's official dataset preparation guide that performs comprehensive checks on JSONL training files. This validation step catches errors early -- before any money is spent on API calls -- and provides actionable feedback about data quality issues.

Format Validation

The validation process checks for multiple categories of structural errors in the JSONL training data:

Error Type Description Check
data_type Each line must parse as a JSON dictionary isinstance(ex, dict)
missing_messages_list Each example must have a "messages" key ex.get("messages")
message_missing_key Each message must have "role" and "content" Key presence check
message_unrecognized_key Messages should only have "role", "content", "name" Unexpected key detection
unrecognized_role Roles must be "system", "user", or "assistant" Role value validation
missing_content Content must be a non-empty string Type and presence check
example_missing_assistant_message At least one message must be from the assistant Role scan across messages

These checks ensure every training example conforms to the OpenAI Chat Completions message structure, which is the required format for finetuning chat models.

Token Counting

Beyond structural validation, the utility performs token-level analysis using the tiktoken library with the cl100k_base encoding (used by GPT-3.5-turbo and GPT-4):

  • Per-message token counting: Accounts for per-message overhead tokens (3 per message) and name tokens (1 per name field)
  • Conversation length analysis: Computes total tokens per example to identify conversations exceeding the 4096-token training limit
  • Assistant token analysis: Separately counts assistant response tokens to understand response length distribution
  • Statistical distributions: Reports min, max, mean, median, p5, and p95 for all token metrics

Cost Estimation

The validation utility estimates training costs using OpenAI's pricing model:

  • Epoch calculation: Automatically determines the number of training epochs based on dataset size, targeting 3 epochs by default with adjustments for small (minimum 100 total examples) or large (maximum 25,000 total examples) datasets
  • Billing token calculation: Each example is capped at 4096 tokens for billing purposes; longer examples are truncated
  • Total cost projection: Multiplies billing tokens by epochs and the per-token price to estimate total training cost

This upfront cost visibility helps teams make informed decisions about dataset size and budgets before committing to a finetuning job.

Validation Workflow

from llama_index.finetuning.openai.validate_json import validate_json

# Validate the training data file
validate_json("training_data.jsonl")
# Outputs:
#   Num examples: 50
#   Format errors (if any)
#   Token distribution statistics
#   Cost estimate

The validation function is also called automatically by OpenAIFinetuneEngine.finetune() before uploading data, unless explicitly disabled with validate_json=False. This ensures that even if a developer skips manual validation, the finetuning engine performs it as a safety check.

Key Considerations

  • Token limit awareness: Examples exceeding 4096 tokens are not rejected but will be truncated during training, potentially losing important context
  • Missing role warnings: The validator reports examples missing system or user messages as warnings, not errors -- they are technically valid but may produce suboptimal training results
  • Epoch auto-tuning: For very small datasets (under ~33 examples), the system automatically increases epochs to ensure the model sees enough training iterations; for very large datasets (over ~8,333 examples), it reduces epochs to avoid overfitting
  • Cost accuracy: The cost estimates use a fixed price ($0.008/1K tokens as of August 2023) which may not reflect current pricing; always check the OpenAI pricing page

Knowledge Sources

Metadata

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment