Implementation:Run llama Llama index Validate Json

Overview

The validate_json function performs comprehensive validation of JSONL training data files before they are submitted to the OpenAI finetuning API. It checks structural format compliance, counts tokens, analyzes distributions, and estimates training costs. This utility is adapted from OpenAI's official dataset preparation guide.

Source File

File: llama-index-finetuning/llama_index/finetuning/openai/validate_json.py
Lines: 19-182
Import: from llama_index.finetuning.openai.validate_json import validate_json

Dependencies

json -- JSONL parsing
numpy -- Statistical distribution calculations
tiktoken -- Token counting with cl100k_base encoding

Function Signature

def validate_json(data_path: str) -> None:

Parameters:

Parameter	Type	Description
`data_path`	`str`	Path to the JSONL file to validate

Returns: None -- All output is printed to stdout.

Validation Stages

Stage 1: Load and Inspect

The function reads all lines from the JSONL file, parsing each as a JSON object:

with open(data_path) as f:
    dataset = [json.loads(line) for line in f]

Prints the total number of examples and the first example for quick inspection.

Stage 2: Format Error Checks

Iterates over every example and checks for the following error categories:

Error Key	Condition	Severity
`data_type`	Example is not a `dict`	Fatal
`missing_messages_list`	No `"messages"` key or empty value	Fatal
`message_missing_key`	A message lacks `"role"` or `"content"`	Error
`message_unrecognized_key`	A message has keys other than `"role"`, `"content"`, `"name"`	Warning
`unrecognized_role`	Role is not `"system"`, `"user"`, or `"assistant"`	Error
`missing_content`	Content is `None`, empty, or not a string	Error
`example_missing_assistant_message`	No message with `"assistant"` role in the example	Error

Errors are accumulated using a defaultdict(int) counter and printed as a summary.

Stage 3: Token Counting

Uses tiktoken.get_encoding("cl100k_base") to count tokens with two internal helper functions:

num_tokens_from_messages: Counts total tokens in a conversation, adding 3 tokens per message overhead and 1 token per name field. Also handles function_call values by converting them to strings before encoding.

num_assistant_tokens_from_messages: Counts only the tokens in assistant role messages, to analyze response length distribution.

Stage 4: Distribution Analysis

Computes and prints distributions for three metrics:

Metric	Description
`num_messages_per_example`	How many messages are in each conversation
`num_total_tokens_per_example`	Total token count per conversation
`num_assistant_tokens_per_example`	Token count of assistant responses only

Each distribution reports: min, max, mean, median, p5 (10th percentile), and p95 (90th percentile).

Also reports how many examples exceed the 4096-token training limit.

Stage 5: Cost Estimation

Estimates training costs using the following logic:

# Constants
MAX_TOKENS_PER_EXAMPLE = 4096
TARGET_EPOCHS = 3
MIN_TARGET_EXAMPLES = 100
MAX_TARGET_EXAMPLES = 25000
MIN_EPOCHS = 1
MAX_EPOCHS = 25

# Epoch auto-tuning
n_epochs = TARGET_EPOCHS
if n_train_examples * TARGET_EPOCHS < MIN_TARGET_EXAMPLES:
    n_epochs = min(MAX_EPOCHS, MIN_TARGET_EXAMPLES // n_train_examples)
elif n_train_examples * TARGET_EPOCHS > MAX_TARGET_EXAMPLES:
    n_epochs = max(MIN_EPOCHS, MAX_TARGET_EXAMPLES // n_train_examples)

# Billing calculation (tokens capped at 4096 per example)
n_billing_tokens_in_dataset = sum(
    min(MAX_TOKENS_PER_EXAMPLE, length) for length in convo_lens
)

Reports: total billing tokens, default epoch count, total charged tokens, and per-epoch cost at $0.008/1K tokens.

Usage Example

from llama_index.finetuning.openai.validate_json import validate_json

# Validate training data before launching finetuning
validate_json("training_data.jsonl")

Sample Output:

Num examples: 50
First example:
{'role': 'system', 'content': 'You are a helpful assistant.'}
{'role': 'user', 'content': 'What is RAG?'}
{'role': 'assistant', 'content': 'RAG stands for...'}
No errors found
Num examples missing system message: 0
Num examples missing user message: 0

#### Distribution of num_messages_per_example:
min / max: 3, 5
mean / median: 3.2, 3.0

#### Distribution of num_total_tokens_per_example:
min / max: 45, 892
mean / median: 234.5, 198.0

Dataset has ~11725 tokens that will be charged for during training
By default, you'll train for 3 epochs on this dataset
By default, you'll be charged for ~35175 tokens

CLI Usage

The file can also be run directly from the command line:

# python validate_json.py <path_to_jsonl_file>
if __name__ == "__main__":
    data_path = sys.argv[1]
    if not os.path.exists(data_path):
        raise ValueError(f"Path {data_path} does not exist")
    validate_json(data_path)

Knowledge Sources

Metadata

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment