Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:Lm sys FastChat Distributed SFT Training

From Leeroopedia


Field Value
Page Type Principle
Title Distributed SFT Training
Repository lm-sys/FastChat
Workflow Vicuna SFT Finetuning
Domains Distributed Training, FSDP, Mixed Precision, Gradient Accumulation
Knowledge Sources fastchat/train/train.py, PyTorch FSDP documentation, Hugging Face Trainer documentation
Last Updated 2026-02-07 14:00 GMT

Overview

This principle covers the theory and practices of distributed training for supervised fine-tuning (SFT) of large language models. It addresses Fully Sharded Data Parallel (FSDP) as the primary distribution strategy, along with gradient accumulation, mixed precision training, checkpoint resumption, and learning rate scheduling -- all of which are essential for training models at the scale of Vicuna (7B-33B parameters).

Description

Fully Sharded Data Parallel (FSDP)

FSDP is PyTorch's native implementation of the ZeRO (Zero Redundancy Optimizer) paradigm. Unlike standard Data Parallel (DP) or Distributed Data Parallel (DDP) which replicate the full model on each GPU, FSDP shards model parameters, gradients, and optimizer states across participating GPUs:

  • Parameter sharding: Each GPU holds only a fraction of the model's parameters. Parameters are gathered (all-gathered) on demand when needed for forward or backward passes, then re-sharded after use.
  • Gradient sharding: Gradients are reduced and scattered so that each GPU only stores gradients for its own parameter shard.
  • Optimizer state sharding: Optimizer states (e.g., Adam momentum and variance) are split across GPUs, dramatically reducing per-GPU memory.

This approach enables training of models that would not fit on a single GPU, while maintaining near-linear scaling efficiency.

Gradient Accumulation

When the desired effective batch size exceeds what can fit in GPU memory (even with FSDP), gradient accumulation simulates a larger batch by:

  1. Running multiple forward-backward passes with smaller micro-batches.
  2. Accumulating gradients across these micro-batches without updating model parameters.
  3. Performing a single optimizer step after accumulating the desired number of micro-batches.

The effective batch size is: effective_batch = per_device_batch * num_devices * gradient_accumulation_steps.

Gradient accumulation is orthogonal to FSDP and can be combined with it to achieve very large effective batch sizes while keeping per-device memory usage manageable.

Mixed Precision Training

Mixed precision training uses lower-precision floating-point formats to reduce memory and increase throughput:

  • fp16 (float16): Halves memory per parameter compared to fp32. Requires loss scaling to handle the reduced dynamic range. Supported on all modern NVIDIA GPUs.
  • bf16 (bfloat16): Provides the same dynamic range as fp32 with reduced precision. Does not require loss scaling. Supported on Ampere and later GPU architectures (A100, H100).
  • Strategy: Model parameters and activations are stored and computed in the lower precision, while critical operations (loss computation, gradient accumulation) may use fp32 for numerical stability.

The Hugging Face Trainer integrates mixed precision via the fp16 and bf16 flags in TrainingArguments.

Checkpoint Resumption

Long training runs (potentially days or weeks for large models) require robust checkpoint resumption:

  • The training state -- model parameters, optimizer state, learning rate scheduler state, and the current step/epoch -- is periodically saved to disk.
  • If training is interrupted (hardware failure, preemption, timeout), it can be resumed from the most recent checkpoint without losing progress.
  • The Vicuna training script implements automatic checkpoint detection: if any checkpoint-* directories exist in the output directory, training resumes from the latest one.

Learning Rate Schedules

The choice of learning rate schedule affects both convergence speed and final model quality:

  • Warmup: The learning rate starts low and increases linearly over a warmup period, allowing the model to stabilize before receiving large gradient updates.
  • Cosine decay: After warmup, the learning rate follows a cosine curve, gradually decreasing to near zero. This provides a smooth annealing effect.
  • Constant with warmup: The learning rate increases during warmup and then remains constant for the rest of training.

The Hugging Face Trainer supports these schedules via the lr_scheduler_type argument.

Optimizer Selection

The FastChat training configuration defaults to AdamW (adamw_torch), the standard optimizer for transformer fine-tuning:

  • AdamW applies weight decay directly to parameters (decoupled from the adaptive learning rate), which provides better regularization than the original Adam with L2 penalty.
  • The adamw_torch variant uses PyTorch's native AdamW implementation, which is compatible with FSDP's parameter sharding.

Usage

When launching distributed SFT training:

  1. Configure FSDP settings in the training arguments or via a separate FSDP config file.
  2. Set the per-device batch size, gradient accumulation steps, and learning rate.
  3. Choose the appropriate mixed precision mode (bf16 on Ampere+, fp16 otherwise).
  4. Launch the training script with torchrun or the Hugging Face Accelerate launcher across multiple GPUs/nodes.
  5. Monitor training via Weights & Biases (if configured) or TensorBoard.
  6. If training is interrupted, simply relaunch the same command; checkpoint resumption will handle the rest.

Theoretical Basis

Distributed training theory is rooted in data parallelism and communication-efficient optimization:

  • Data parallelism processes different mini-batches on different devices, then synchronizes gradients. The mathematical equivalence between distributed and single-device training (with the same effective batch size) is guaranteed by the linearity of gradient computation.
  • FSDP extends this with memory-efficient sharding, based on the observation that each parameter is only needed during specific phases of the forward and backward pass. By materializing parameters on demand and re-sharding them afterward, FSDP reduces the per-GPU memory footprint from O(model_size) to O(model_size / num_gpus) for parameters and optimizer states.
  • Mixed precision exploits the fact that neural network training is tolerant of moderate numerical imprecision, allowing lower-precision arithmetic to achieve near-identical convergence with significant efficiency gains.
  • Gradient accumulation is mathematically equivalent to using a larger batch size, under the assumption that micro-batch gradients are unbiased estimates of the full-batch gradient.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment