Implementation:Run llama Llama index Validate Json
Overview
The validate_json function performs comprehensive validation of JSONL training data files before they are submitted to the OpenAI finetuning API. It checks structural format compliance, counts tokens, analyzes distributions, and estimates training costs. This utility is adapted from OpenAI's official dataset preparation guide.
Source File
- File:
llama-index-finetuning/llama_index/finetuning/openai/validate_json.py - Lines: 19-182
- Import:
from llama_index.finetuning.openai.validate_json import validate_json
Dependencies
json-- JSONL parsingnumpy-- Statistical distribution calculationstiktoken-- Token counting withcl100k_baseencoding
Function Signature
def validate_json(data_path: str) -> None:
Parameters:
| Parameter | Type | Description |
|---|---|---|
data_path |
str |
Path to the JSONL file to validate |
Returns: None -- All output is printed to stdout.
Validation Stages
Stage 1: Load and Inspect
The function reads all lines from the JSONL file, parsing each as a JSON object:
with open(data_path) as f:
dataset = [json.loads(line) for line in f]
Prints the total number of examples and the first example for quick inspection.
Stage 2: Format Error Checks
Iterates over every example and checks for the following error categories:
| Error Key | Condition | Severity |
|---|---|---|
data_type |
Example is not a dict |
Fatal |
missing_messages_list |
No "messages" key or empty value |
Fatal |
message_missing_key |
A message lacks "role" or "content" |
Error |
message_unrecognized_key |
A message has keys other than "role", "content", "name" |
Warning |
unrecognized_role |
Role is not "system", "user", or "assistant" |
Error |
missing_content |
Content is None, empty, or not a string |
Error |
example_missing_assistant_message |
No message with "assistant" role in the example |
Error |
Errors are accumulated using a defaultdict(int) counter and printed as a summary.
Stage 3: Token Counting
Uses tiktoken.get_encoding("cl100k_base") to count tokens with two internal helper functions:
num_tokens_from_messages: Counts total tokens in a conversation, adding 3 tokens per message overhead and 1 token per name field. Also handles function_call values by converting them to strings before encoding.
num_assistant_tokens_from_messages: Counts only the tokens in assistant role messages, to analyze response length distribution.
Stage 4: Distribution Analysis
Computes and prints distributions for three metrics:
| Metric | Description |
|---|---|
num_messages_per_example |
How many messages are in each conversation |
num_total_tokens_per_example |
Total token count per conversation |
num_assistant_tokens_per_example |
Token count of assistant responses only |
Each distribution reports: min, max, mean, median, p5 (10th percentile), and p95 (90th percentile).
Also reports how many examples exceed the 4096-token training limit.
Stage 5: Cost Estimation
Estimates training costs using the following logic:
# Constants
MAX_TOKENS_PER_EXAMPLE = 4096
TARGET_EPOCHS = 3
MIN_TARGET_EXAMPLES = 100
MAX_TARGET_EXAMPLES = 25000
MIN_EPOCHS = 1
MAX_EPOCHS = 25
# Epoch auto-tuning
n_epochs = TARGET_EPOCHS
if n_train_examples * TARGET_EPOCHS < MIN_TARGET_EXAMPLES:
n_epochs = min(MAX_EPOCHS, MIN_TARGET_EXAMPLES // n_train_examples)
elif n_train_examples * TARGET_EPOCHS > MAX_TARGET_EXAMPLES:
n_epochs = max(MIN_EPOCHS, MAX_TARGET_EXAMPLES // n_train_examples)
# Billing calculation (tokens capped at 4096 per example)
n_billing_tokens_in_dataset = sum(
min(MAX_TOKENS_PER_EXAMPLE, length) for length in convo_lens
)
Reports: total billing tokens, default epoch count, total charged tokens, and per-epoch cost at $0.008/1K tokens.
Usage Example
from llama_index.finetuning.openai.validate_json import validate_json
# Validate training data before launching finetuning
validate_json("training_data.jsonl")
Sample Output:
Num examples: 50
First example:
{'role': 'system', 'content': 'You are a helpful assistant.'}
{'role': 'user', 'content': 'What is RAG?'}
{'role': 'assistant', 'content': 'RAG stands for...'}
No errors found
Num examples missing system message: 0
Num examples missing user message: 0
#### Distribution of num_messages_per_example:
min / max: 3, 5
mean / median: 3.2, 3.0
#### Distribution of num_total_tokens_per_example:
min / max: 45, 892
mean / median: 234.5, 198.0
Dataset has ~11725 tokens that will be charged for during training
By default, you'll train for 3 epochs on this dataset
By default, you'll be charged for ~35175 tokens
CLI Usage
The file can also be run directly from the command line:
# python validate_json.py <path_to_jsonl_file>
if __name__ == "__main__":
data_path = sys.argv[1]
if not os.path.exists(data_path):
raise ValueError(f"Path {data_path} does not exist")
validate_json(data_path)