Principle:Lm sys FastChat Distributed SFT Training
| Field | Value |
|---|---|
| Page Type | Principle |
| Title | Distributed SFT Training |
| Repository | lm-sys/FastChat |
| Workflow | Vicuna SFT Finetuning |
| Domains | Distributed Training, FSDP, Mixed Precision, Gradient Accumulation |
| Knowledge Sources | fastchat/train/train.py, PyTorch FSDP documentation, Hugging Face Trainer documentation |
| Last Updated | 2026-02-07 14:00 GMT |
Overview
This principle covers the theory and practices of distributed training for supervised fine-tuning (SFT) of large language models. It addresses Fully Sharded Data Parallel (FSDP) as the primary distribution strategy, along with gradient accumulation, mixed precision training, checkpoint resumption, and learning rate scheduling -- all of which are essential for training models at the scale of Vicuna (7B-33B parameters).
Description
Fully Sharded Data Parallel (FSDP)
FSDP is PyTorch's native implementation of the ZeRO (Zero Redundancy Optimizer) paradigm. Unlike standard Data Parallel (DP) or Distributed Data Parallel (DDP) which replicate the full model on each GPU, FSDP shards model parameters, gradients, and optimizer states across participating GPUs:
- Parameter sharding: Each GPU holds only a fraction of the model's parameters. Parameters are gathered (all-gathered) on demand when needed for forward or backward passes, then re-sharded after use.
- Gradient sharding: Gradients are reduced and scattered so that each GPU only stores gradients for its own parameter shard.
- Optimizer state sharding: Optimizer states (e.g., Adam momentum and variance) are split across GPUs, dramatically reducing per-GPU memory.
This approach enables training of models that would not fit on a single GPU, while maintaining near-linear scaling efficiency.
Gradient Accumulation
When the desired effective batch size exceeds what can fit in GPU memory (even with FSDP), gradient accumulation simulates a larger batch by:
- Running multiple forward-backward passes with smaller micro-batches.
- Accumulating gradients across these micro-batches without updating model parameters.
- Performing a single optimizer step after accumulating the desired number of micro-batches.
The effective batch size is: effective_batch = per_device_batch * num_devices * gradient_accumulation_steps.
Gradient accumulation is orthogonal to FSDP and can be combined with it to achieve very large effective batch sizes while keeping per-device memory usage manageable.
Mixed Precision Training
Mixed precision training uses lower-precision floating-point formats to reduce memory and increase throughput:
- fp16 (float16): Halves memory per parameter compared to fp32. Requires loss scaling to handle the reduced dynamic range. Supported on all modern NVIDIA GPUs.
- bf16 (bfloat16): Provides the same dynamic range as fp32 with reduced precision. Does not require loss scaling. Supported on Ampere and later GPU architectures (A100, H100).
- Strategy: Model parameters and activations are stored and computed in the lower precision, while critical operations (loss computation, gradient accumulation) may use fp32 for numerical stability.
The Hugging Face Trainer integrates mixed precision via the fp16 and bf16 flags in TrainingArguments.
Checkpoint Resumption
Long training runs (potentially days or weeks for large models) require robust checkpoint resumption:
- The training state -- model parameters, optimizer state, learning rate scheduler state, and the current step/epoch -- is periodically saved to disk.
- If training is interrupted (hardware failure, preemption, timeout), it can be resumed from the most recent checkpoint without losing progress.
- The Vicuna training script implements automatic checkpoint detection: if any
checkpoint-*directories exist in the output directory, training resumes from the latest one.
Learning Rate Schedules
The choice of learning rate schedule affects both convergence speed and final model quality:
- Warmup: The learning rate starts low and increases linearly over a warmup period, allowing the model to stabilize before receiving large gradient updates.
- Cosine decay: After warmup, the learning rate follows a cosine curve, gradually decreasing to near zero. This provides a smooth annealing effect.
- Constant with warmup: The learning rate increases during warmup and then remains constant for the rest of training.
The Hugging Face Trainer supports these schedules via the lr_scheduler_type argument.
Optimizer Selection
The FastChat training configuration defaults to AdamW (adamw_torch), the standard optimizer for transformer fine-tuning:
- AdamW applies weight decay directly to parameters (decoupled from the adaptive learning rate), which provides better regularization than the original Adam with L2 penalty.
- The
adamw_torchvariant uses PyTorch's native AdamW implementation, which is compatible with FSDP's parameter sharding.
Usage
When launching distributed SFT training:
- Configure FSDP settings in the training arguments or via a separate FSDP config file.
- Set the per-device batch size, gradient accumulation steps, and learning rate.
- Choose the appropriate mixed precision mode (bf16 on Ampere+, fp16 otherwise).
- Launch the training script with
torchrunor the Hugging Face Accelerate launcher across multiple GPUs/nodes. - Monitor training via Weights & Biases (if configured) or TensorBoard.
- If training is interrupted, simply relaunch the same command; checkpoint resumption will handle the rest.
Theoretical Basis
Distributed training theory is rooted in data parallelism and communication-efficient optimization:
- Data parallelism processes different mini-batches on different devices, then synchronizes gradients. The mathematical equivalence between distributed and single-device training (with the same effective batch size) is guaranteed by the linearity of gradient computation.
- FSDP extends this with memory-efficient sharding, based on the observation that each parameter is only needed during specific phases of the forward and backward pass. By materializing parameters on demand and re-sharding them afterward, FSDP reduces the per-GPU memory footprint from O(model_size) to O(model_size / num_gpus) for parameters and optimizer states.
- Mixed precision exploits the fact that neural network training is tolerant of moderate numerical imprecision, allowing lower-precision arithmetic to achieve near-identical convergence with significant efficiency gains.
- Gradient accumulation is mathematically equivalent to using a larger batch size, under the assumption that micro-batch gradients are unbiased estimates of the full-batch gradient.
Related Pages
- Implementation:Lm_sys_FastChat_HF_Trainer_Train_FSDP
- Implemented by: Implementation:Lm_sys_FastChat_HF_Trainer_Train_FSDP
- Implementation:Lm_sys_FastChat_Train_Baichuan
- Implemented by (variant): Implementation:Lm_sys_FastChat_Train_Baichuan — Baichuan-specific SFT with multiprocessing
- Implementation:Lm_sys_FastChat_Train_Yuan2
- Implemented by (variant): Implementation:Lm_sys_FastChat_Train_Yuan2 — Yuan2-specific SFT with 3 loss modes
- Heuristic:Lm_sys_FastChat_Flash_Attention_GPU_Requirements