Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Principle:BerriAI Litellm Training Data Preparation

From Leeroopedia
Knowledge Sources Domains Last Updated
LLM Fine-Tuning Best Practices, OpenAI Fine-Tuning Guide Machine Learning, Natural Language Processing, Data Engineering 2026-02-15

Overview

Training data preparation is the process of structuring and formatting domain-specific examples into a schema that fine-tuning APIs can consume to customize large language model behavior.

Description

Fine-tuning a large language model requires a carefully curated dataset that teaches the model the desired input-output behavior for a specific domain or task. Training data preparation encompasses the entire pipeline from raw examples to a validated, upload-ready file. The core challenge is converting human knowledge -- example conversations, question-answer pairs, or instruction-response sequences -- into a structured format that the fine-tuning engine can parse and learn from.

The standard format across most LLM providers is JSONL (JSON Lines), where each line is a self-contained JSON object representing a single training example. Each example typically follows a chat completion schema: an array of message objects with defined roles (system, user, assistant) that represent one complete conversational turn. This format mirrors the inference-time input schema, ensuring consistency between training and deployment.

Beyond formatting, preparation also involves defining hyperparameters that control the training process itself -- batch size, learning rate multiplier, and number of training epochs. These parameters govern convergence speed, overfitting risk, and overall model quality.

Usage

Training data preparation should be applied whenever:

  • A base model needs to be adapted to a specific domain vocabulary, tone, or task structure.
  • Consistent output formatting is required that cannot be achieved through prompting alone.
  • A set of high-quality example interactions exists that demonstrates the desired model behavior.
  • Hyperparameters need to be tuned to balance training cost, speed, and quality.

Theoretical Basis

Data Format Theory

The JSONL format ensures that each training example is independently parseable and streamable, allowing large datasets to be processed without loading the entire file into memory. Each line must be a valid JSON object conforming to the chat completion schema:

{"messages": [{"role": "system", "content": "..."}, {"role": "user", "content": "..."}, {"role": "assistant", "content": "..."}]}
{"messages": [{"role": "user", "content": "..."}, {"role": "assistant", "content": "..."}]}

The role field segments the conversation into context-setting (system), input (user), and desired output (assistant) components. The model learns to predict the assistant turns given the preceding context.

Hyperparameter Theory

Fine-tuning hyperparameters control the optimization trajectory:

  • batch_size: The number of training examples processed per gradient update. Larger batches reduce variance in gradient estimates but require more memory. Setting this to "auto" allows the provider to select an optimal value.
  • learning_rate_multiplier: A scaling factor applied to the base learning rate. Higher values accelerate adaptation but risk overshooting optimal weights. Lower values provide more stable convergence.
  • n_epochs: The total number of passes through the training dataset. More epochs allow the model to see each example multiple times, improving memorization but increasing overfitting risk.

Pseudocode: Training Data Pipeline

FUNCTION prepare_training_data(raw_examples):
    formatted_lines = []
    FOR each example IN raw_examples:
        messages = []
        IF example has system_prompt:
            messages.APPEND({"role": "system", "content": example.system_prompt})
        messages.APPEND({"role": "user", "content": example.user_input})
        messages.APPEND({"role": "assistant", "content": example.desired_output})
        formatted_lines.APPEND(JSON_serialize({"messages": messages}))
    WRITE formatted_lines to file as JSONL
    RETURN file_path

FUNCTION define_hyperparameters(batch_size, learning_rate, epochs):
    params = {}
    IF batch_size is not None:
        params["batch_size"] = batch_size
    IF learning_rate is not None:
        params["learning_rate_multiplier"] = learning_rate
    IF epochs is not None:
        params["n_epochs"] = epochs
    RETURN params

Validation File

An optional validation file follows the same JSONL format but contains held-out examples not used during training. The fine-tuning engine evaluates the model on this set periodically to measure generalization and detect overfitting. Best practice is to allocate 10-20% of total examples for validation.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment