Heuristic:Openai Openai python Fine Tuning Data Preparation Tips
| Knowledge Sources | |
|---|---|
| Domains | Fine_Tuning, Data_Preparation |
| Last Updated | 2026-02-15 10:00 GMT |
Overview
Built-in training data validation tips from the SDK's validators framework: use at least 100 examples, add consistent separators, start completions with whitespace, and keep examples under 2048 tokens.
Description
The `lib/_validators.py` module contains a comprehensive validation and remediation framework for fine-tuning training data. It checks JSONL format correctness, column naming, prompt/completion formatting, and suggests fixes for common issues. The validators encode practical wisdom about what makes fine-tuning data effective, drawn from OpenAI's experience with thousands of fine-tuning jobs.
Usage
Apply these tips when preparing training data for the fine-tuning workflow. The SDK's `openai tools fine_tunes.prepare_data` CLI command runs these validators automatically, but understanding the rules helps you prepare better data upfront and avoid remediation cycles.
The Insight (Rule of Thumb)
- Minimum examples: Use at least 100 prompt-completion pairs. Performance tends to linearly increase for every doubling of the number of examples.
- Column naming: Column names must be lowercase (`prompt`, `completion`). The validator will auto-fix case mismatches.
- Prompt separator: Add a consistent separator at the end of all prompts (e.g., `\n\n###\n\n`). This helps the model distinguish prompt from completion.
- Completion whitespace: Completions should start with a whitespace character (space). This improves tokenization quality.
- Completion ending: Add a common ending string to all completions (e.g., `###` or `END`). This teaches the model when to stop generating.
- Token limit: Keep examples under 2048 tokens for conditional generation and classification tasks.
- No instructions in prompts: Do not include task descriptions or few-shot examples in the fine-tuning prompts — put only the input. The model learns the task from the pattern.
- Remove common prefixes: Strip common prefixes from completions to avoid the model learning redundant patterns.
- Duplicate detection: The validator warns about duplicate column names.
- Excel files: Only the first sheet of Excel files is read; use single-sheet files or convert to CSV.
Reasoning
These rules are derived from empirical observations across many fine-tuning jobs:
- 100+ examples: Smaller datasets lead to overfitting and poor generalization. The linear scaling relationship means 200 examples is roughly twice as effective as 100.
- Separators: Without consistent separators, the model has difficulty learning where prompts end and completions begin, leading to poor completion quality.
- Leading whitespace: GPT tokenizers treat ` hello` differently from `hello`. Starting with a space ensures the first token of the completion is correctly tokenized.
- 2048 token limit: Longer examples may be truncated during training, leading to incomplete learning signals.
Code evidence from `lib/_validators.py:25-36`:
def num_examples_validator(df):
MIN_EXAMPLES = 100
optional_suggestion = (
""
if len(df) >= MIN_EXAMPLES
else ". In general, we recommend having at least a few hundred examples. "
"We've found that performance tends to linearly increase for every "
"doubling of the number of examples"
)
immediate_msg = f"\n- Your file contains {len(df)} prompt-completion pairs{optional_suggestion}"
Column naming validation from `lib/_validators.py:61`:
immediate_msg = f"\n- The `{necessary_column}` column/key should be lowercase"