Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Heuristic:Openai Openai python Fine Tuning Data Preparation Tips

From Leeroopedia
Knowledge Sources
Domains Fine_Tuning, Data_Preparation
Last Updated 2026-02-15 10:00 GMT

Overview

Built-in training data validation tips from the SDK's validators framework: use at least 100 examples, add consistent separators, start completions with whitespace, and keep examples under 2048 tokens.

Description

The `lib/_validators.py` module contains a comprehensive validation and remediation framework for fine-tuning training data. It checks JSONL format correctness, column naming, prompt/completion formatting, and suggests fixes for common issues. The validators encode practical wisdom about what makes fine-tuning data effective, drawn from OpenAI's experience with thousands of fine-tuning jobs.

Usage

Apply these tips when preparing training data for the fine-tuning workflow. The SDK's `openai tools fine_tunes.prepare_data` CLI command runs these validators automatically, but understanding the rules helps you prepare better data upfront and avoid remediation cycles.

The Insight (Rule of Thumb)

  • Minimum examples: Use at least 100 prompt-completion pairs. Performance tends to linearly increase for every doubling of the number of examples.
  • Column naming: Column names must be lowercase (`prompt`, `completion`). The validator will auto-fix case mismatches.
  • Prompt separator: Add a consistent separator at the end of all prompts (e.g., `\n\n###\n\n`). This helps the model distinguish prompt from completion.
  • Completion whitespace: Completions should start with a whitespace character (space). This improves tokenization quality.
  • Completion ending: Add a common ending string to all completions (e.g., `###` or `END`). This teaches the model when to stop generating.
  • Token limit: Keep examples under 2048 tokens for conditional generation and classification tasks.
  • No instructions in prompts: Do not include task descriptions or few-shot examples in the fine-tuning prompts — put only the input. The model learns the task from the pattern.
  • Remove common prefixes: Strip common prefixes from completions to avoid the model learning redundant patterns.
  • Duplicate detection: The validator warns about duplicate column names.
  • Excel files: Only the first sheet of Excel files is read; use single-sheet files or convert to CSV.

Reasoning

These rules are derived from empirical observations across many fine-tuning jobs:

  • 100+ examples: Smaller datasets lead to overfitting and poor generalization. The linear scaling relationship means 200 examples is roughly twice as effective as 100.
  • Separators: Without consistent separators, the model has difficulty learning where prompts end and completions begin, leading to poor completion quality.
  • Leading whitespace: GPT tokenizers treat ` hello` differently from `hello`. Starting with a space ensures the first token of the completion is correctly tokenized.
  • 2048 token limit: Longer examples may be truncated during training, leading to incomplete learning signals.

Code evidence from `lib/_validators.py:25-36`:

def num_examples_validator(df):
    MIN_EXAMPLES = 100
    optional_suggestion = (
        ""
        if len(df) >= MIN_EXAMPLES
        else ". In general, we recommend having at least a few hundred examples. "
             "We've found that performance tends to linearly increase for every "
             "doubling of the number of examples"
    )
    immediate_msg = f"\n- Your file contains {len(df)} prompt-completion pairs{optional_suggestion}"

Column naming validation from `lib/_validators.py:61`:

immediate_msg = f"\n- The `{necessary_column}` column/key should be lowercase"

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment