Principle:Lm sys FastChat SFT Data Preparation

Field	Value
Page Type	Principle
Title	SFT Data Preparation
Repository	lm-sys/FastChat
Workflow	Vicuna SFT Finetuning
Domains	Supervised Fine-Tuning, Data Engineering, NLP
Knowledge Sources	fastchat/train/train.py, ShareGPT dataset format, Vicuna training documentation
Last Updated	2026-02-07 14:00 GMT

Overview

This principle describes the theory and practices for preparing supervised fine-tuning (SFT) data for large language models. It covers the ShareGPT conversation format used by the Vicuna training pipeline, the distinction between eager and lazy data loading strategies, and the considerations that govern how raw conversation data is transformed into training-ready datasets.

Description

The ShareGPT Conversation Format

The Vicuna SFT pipeline expects training data in the ShareGPT conversation format, a JSON structure designed to represent multi-turn dialogues between a human user and a GPT-based assistant. Each training example is a JSON object with the following structure:

[
  {
    "id": "unique_conversation_id",
    "conversations": [
      {"from": "human", "value": "What is the capital of France?"},
      {"from": "gpt", "value": "The capital of France is Paris."},
      {"from": "human", "value": "What is its population?"},
      {"from": "gpt", "value": "Paris has a population of approximately 2.1 million..."}
    ]
  }
]

Key structural requirements:

The top-level structure is a list of dictionaries, each representing one conversation.
Each dictionary contains an "id" field (string identifier) and a "conversations" field (list of turn dictionaries).
Each turn dictionary has a "from" field (either "human" or "gpt") and a "value" field (the text content).
Turns must alternate between "human" and "gpt" roles.
The first turn should be from "human". If it is not, the pipeline will skip the first turn to enforce this constraint.

Eager vs. Lazy Data Loading

The training pipeline supports two data loading strategies, each with distinct trade-offs:

Eager Loading (SupervisedDataset)

In eager mode, all data is preprocessed at initialization time:

The entire JSON file is loaded into memory.
All conversations are tokenized, padded, and target-masked in a single batch operation.
The resulting tensors (input_ids, labels, attention_mask) are stored in memory.
Advantages: Fast per-sample access during training; no repeated tokenization.
Disadvantages: High initial memory usage; long startup time for large datasets; the entire dataset must fit in memory as tensors.

Lazy Loading (LazySupervisedDataset)

In lazy mode, data is preprocessed on-demand:

The raw JSON data is loaded at initialization, but tokenization is deferred.
Each sample is tokenized the first time it is accessed via __getitem__.
Processed samples are cached in a dictionary (cached_data_dict) to avoid re-tokenization on subsequent accesses.
Advantages: Low startup time; lower peak memory for partial dataset usage; better for very large datasets.
Disadvantages: First-epoch access is slower due to on-the-fly tokenization; cache grows over time.

Data Quality Considerations

Effective SFT data preparation requires attention to:

Conversation coherence: Each multi-turn conversation should be logically consistent. The assistant responses should be relevant to the human queries.
Turn alternation: Strict alternation between human and gpt roles is enforced by the preprocessing pipeline.
Content diversity: The training data should cover a broad range of topics, instruction types, and response styles to produce a general-purpose assistant.
Length distribution: Conversations that exceed the model's maximum sequence length will be truncated, potentially losing important context. Understanding the length distribution of the data informs the choice of model_max_length.

Usage

When preparing data for Vicuna SFT fine-tuning:

Collect or curate conversations in the ShareGPT JSON format.
Validate that all conversations have alternating human/gpt turns.
Choose between eager and lazy loading based on dataset size and available memory.
Set the data_path argument to point to the training JSON file.
Optionally set eval_data_path for a held-out evaluation set.
Set lazy_preprocess=True for lazy loading on large datasets.

Theoretical Basis

Supervised fine-tuning (SFT) is the process of adapting a pre-trained language model to follow instructions by training on curated (prompt, response) pairs. The theoretical foundation rests on:

Transfer learning: The pre-trained model already encodes broad language understanding; SFT teaches it to apply that understanding in a conversational instruction-following format.
Behavioral cloning: SFT is a form of imitation learning where the model learns to replicate the behavior demonstrated in the training conversations.
Multi-turn context: Including full conversation histories (not just single-turn pairs) teaches the model to maintain coherence across multiple exchanges, a critical capability for interactive assistants.

The choice between eager and lazy loading reflects a classic time-space trade-off in data engineering: eager loading trades memory for speed, while lazy loading trades speed for memory efficiency.

Related Pages

Implementation:Lm_sys_FastChat_Make_Supervised_Data_Module
Implemented by: Implementation:Lm_sys_FastChat_Make_Supervised_Data_Module

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment