Principle:Lm sys FastChat SFT Data Preparation
| Field | Value |
|---|---|
| Page Type | Principle |
| Title | SFT Data Preparation |
| Repository | lm-sys/FastChat |
| Workflow | Vicuna SFT Finetuning |
| Domains | Supervised Fine-Tuning, Data Engineering, NLP |
| Knowledge Sources | fastchat/train/train.py, ShareGPT dataset format, Vicuna training documentation |
| Last Updated | 2026-02-07 14:00 GMT |
Overview
This principle describes the theory and practices for preparing supervised fine-tuning (SFT) data for large language models. It covers the ShareGPT conversation format used by the Vicuna training pipeline, the distinction between eager and lazy data loading strategies, and the considerations that govern how raw conversation data is transformed into training-ready datasets.
Description
The Vicuna SFT pipeline expects training data in the ShareGPT conversation format, a JSON structure designed to represent multi-turn dialogues between a human user and a GPT-based assistant. Each training example is a JSON object with the following structure:
[
{
"id": "unique_conversation_id",
"conversations": [
{"from": "human", "value": "What is the capital of France?"},
{"from": "gpt", "value": "The capital of France is Paris."},
{"from": "human", "value": "What is its population?"},
{"from": "gpt", "value": "Paris has a population of approximately 2.1 million..."}
]
}
]
Key structural requirements:
- The top-level structure is a list of dictionaries, each representing one conversation.
- Each dictionary contains an
"id"field (string identifier) and a"conversations"field (list of turn dictionaries). - Each turn dictionary has a
"from"field (either"human"or"gpt") and a"value"field (the text content). - Turns must alternate between
"human"and"gpt"roles. - The first turn should be from
"human". If it is not, the pipeline will skip the first turn to enforce this constraint.
Eager vs. Lazy Data Loading
The training pipeline supports two data loading strategies, each with distinct trade-offs:
Eager Loading (SupervisedDataset)
In eager mode, all data is preprocessed at initialization time:
- The entire JSON file is loaded into memory.
- All conversations are tokenized, padded, and target-masked in a single batch operation.
- The resulting tensors (input_ids, labels, attention_mask) are stored in memory.
- Advantages: Fast per-sample access during training; no repeated tokenization.
- Disadvantages: High initial memory usage; long startup time for large datasets; the entire dataset must fit in memory as tensors.
Lazy Loading (LazySupervisedDataset)
In lazy mode, data is preprocessed on-demand:
- The raw JSON data is loaded at initialization, but tokenization is deferred.
- Each sample is tokenized the first time it is accessed via
__getitem__. - Processed samples are cached in a dictionary (
cached_data_dict) to avoid re-tokenization on subsequent accesses. - Advantages: Low startup time; lower peak memory for partial dataset usage; better for very large datasets.
- Disadvantages: First-epoch access is slower due to on-the-fly tokenization; cache grows over time.
Data Quality Considerations
Effective SFT data preparation requires attention to:
- Conversation coherence: Each multi-turn conversation should be logically consistent. The assistant responses should be relevant to the human queries.
- Turn alternation: Strict alternation between human and gpt roles is enforced by the preprocessing pipeline.
- Content diversity: The training data should cover a broad range of topics, instruction types, and response styles to produce a general-purpose assistant.
- Length distribution: Conversations that exceed the model's maximum sequence length will be truncated, potentially losing important context. Understanding the length distribution of the data informs the choice of
model_max_length.
Usage
When preparing data for Vicuna SFT fine-tuning:
- Collect or curate conversations in the ShareGPT JSON format.
- Validate that all conversations have alternating human/gpt turns.
- Choose between eager and lazy loading based on dataset size and available memory.
- Set the
data_pathargument to point to the training JSON file. - Optionally set
eval_data_pathfor a held-out evaluation set. - Set
lazy_preprocess=Truefor lazy loading on large datasets.
Theoretical Basis
Supervised fine-tuning (SFT) is the process of adapting a pre-trained language model to follow instructions by training on curated (prompt, response) pairs. The theoretical foundation rests on:
- Transfer learning: The pre-trained model already encodes broad language understanding; SFT teaches it to apply that understanding in a conversational instruction-following format.
- Behavioral cloning: SFT is a form of imitation learning where the model learns to replicate the behavior demonstrated in the training conversations.
- Multi-turn context: Including full conversation histories (not just single-turn pairs) teaches the model to maintain coherence across multiple exchanges, a critical capability for interactive assistants.
The choice between eager and lazy loading reflects a classic time-space trade-off in data engineering: eager loading trades memory for speed, while lazy loading trades speed for memory efficiency.