Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Run llama Llama index Validate Json

From Leeroopedia

Overview

The validate_json function performs comprehensive validation of JSONL training data files before they are submitted to the OpenAI finetuning API. It checks structural format compliance, counts tokens, analyzes distributions, and estimates training costs. This utility is adapted from OpenAI's official dataset preparation guide.

Source File

  • File: llama-index-finetuning/llama_index/finetuning/openai/validate_json.py
  • Lines: 19-182
  • Import: from llama_index.finetuning.openai.validate_json import validate_json

Dependencies

  • json -- JSONL parsing
  • numpy -- Statistical distribution calculations
  • tiktoken -- Token counting with cl100k_base encoding

Function Signature

def validate_json(data_path: str) -> None:

Parameters:

Parameter Type Description
data_path str Path to the JSONL file to validate

Returns: None -- All output is printed to stdout.

Validation Stages

Stage 1: Load and Inspect

The function reads all lines from the JSONL file, parsing each as a JSON object:

with open(data_path) as f:
    dataset = [json.loads(line) for line in f]

Prints the total number of examples and the first example for quick inspection.

Stage 2: Format Error Checks

Iterates over every example and checks for the following error categories:

Error Key Condition Severity
data_type Example is not a dict Fatal
missing_messages_list No "messages" key or empty value Fatal
message_missing_key A message lacks "role" or "content" Error
message_unrecognized_key A message has keys other than "role", "content", "name" Warning
unrecognized_role Role is not "system", "user", or "assistant" Error
missing_content Content is None, empty, or not a string Error
example_missing_assistant_message No message with "assistant" role in the example Error

Errors are accumulated using a defaultdict(int) counter and printed as a summary.

Stage 3: Token Counting

Uses tiktoken.get_encoding("cl100k_base") to count tokens with two internal helper functions:

num_tokens_from_messages: Counts total tokens in a conversation, adding 3 tokens per message overhead and 1 token per name field. Also handles function_call values by converting them to strings before encoding.

num_assistant_tokens_from_messages: Counts only the tokens in assistant role messages, to analyze response length distribution.

Stage 4: Distribution Analysis

Computes and prints distributions for three metrics:

Metric Description
num_messages_per_example How many messages are in each conversation
num_total_tokens_per_example Total token count per conversation
num_assistant_tokens_per_example Token count of assistant responses only

Each distribution reports: min, max, mean, median, p5 (10th percentile), and p95 (90th percentile).

Also reports how many examples exceed the 4096-token training limit.

Stage 5: Cost Estimation

Estimates training costs using the following logic:

# Constants
MAX_TOKENS_PER_EXAMPLE = 4096
TARGET_EPOCHS = 3
MIN_TARGET_EXAMPLES = 100
MAX_TARGET_EXAMPLES = 25000
MIN_EPOCHS = 1
MAX_EPOCHS = 25

# Epoch auto-tuning
n_epochs = TARGET_EPOCHS
if n_train_examples * TARGET_EPOCHS < MIN_TARGET_EXAMPLES:
    n_epochs = min(MAX_EPOCHS, MIN_TARGET_EXAMPLES // n_train_examples)
elif n_train_examples * TARGET_EPOCHS > MAX_TARGET_EXAMPLES:
    n_epochs = max(MIN_EPOCHS, MAX_TARGET_EXAMPLES // n_train_examples)

# Billing calculation (tokens capped at 4096 per example)
n_billing_tokens_in_dataset = sum(
    min(MAX_TOKENS_PER_EXAMPLE, length) for length in convo_lens
)

Reports: total billing tokens, default epoch count, total charged tokens, and per-epoch cost at $0.008/1K tokens.

Usage Example

from llama_index.finetuning.openai.validate_json import validate_json

# Validate training data before launching finetuning
validate_json("training_data.jsonl")

Sample Output:

Num examples: 50
First example:
{'role': 'system', 'content': 'You are a helpful assistant.'}
{'role': 'user', 'content': 'What is RAG?'}
{'role': 'assistant', 'content': 'RAG stands for...'}
No errors found
Num examples missing system message: 0
Num examples missing user message: 0

#### Distribution of num_messages_per_example:
min / max: 3, 5
mean / median: 3.2, 3.0

#### Distribution of num_total_tokens_per_example:
min / max: 45, 892
mean / median: 234.5, 198.0

Dataset has ~11725 tokens that will be charged for during training
By default, you'll train for 3 epochs on this dataset
By default, you'll be charged for ~35175 tokens

CLI Usage

The file can also be run directly from the command line:

# python validate_json.py <path_to_jsonl_file>
if __name__ == "__main__":
    data_path = sys.argv[1]
    if not os.path.exists(data_path):
        raise ValueError(f"Path {data_path} does not exist")
    validate_json(data_path)

Knowledge Sources

Metadata

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment