Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:Lm sys FastChat DeepSpeed LoRA Training

From Leeroopedia


Knowledge Sources
Domains NLP, Training, Distributed Systems
Last Updated 2026-02-07 14:00 GMT

Overview

Training LoRA adapters on pretrained language models using DeepSpeed ZeRO optimization stages for distributed, memory-efficient fine-tuning through the HuggingFace Trainer integration.

Description

DeepSpeed is a deep learning optimization library developed by Microsoft that enables efficient distributed training through its ZeRO (Zero Redundancy Optimizer) technology. In FastChat's LoRA training pipeline, DeepSpeed integrates with the HuggingFace Trainer class to provide seamless multi-GPU training with optimizer state partitioning, gradient partitioning, and optional parameter partitioning.

The training process in FastChat's train_lora.py encompasses several key aspects:

  1. ZeRO Optimization Stages -- FastChat provides two DeepSpeed configuration files:
    • ZeRO Stage 2 (playground/deepspeed_config_s2.json): Partitions optimizer states and gradients across GPUs. Includes CPU offloading of optimizer states, contiguous gradient allocation, and communication overlap. This is the recommended configuration for LoRA and QLoRA training.
    • ZeRO Stage 3 (playground/deepspeed_config_s3.json): Additionally partitions model parameters across GPUs. Includes both optimizer and parameter CPU offloading, pinned memory, and automatic 16-bit weight gathering on model save. Note that ZeRO-3 is incompatible with QLoRA.
  2. HuggingFace Trainer Integration -- The Trainer class automatically detects the --deepspeed argument and initializes the DeepSpeed engine. The trainer handles distributed data loading, gradient accumulation, mixed precision, and checkpoint management.
  3. Training Data Pipeline -- The supervised data module (from make_supervised_data_module()) provides the tokenized conversation data. The Trainer passes this to DeepSpeed's distributed data sampler automatically.
  4. FlashAttention Support -- When --flash_attn True is set, the script applies replace_llama_attn_with_flash_attn() to monkey-patch LLaMA's attention implementation with FlashAttention for faster training throughput.
  5. Checkpoint Resume -- The training logic checks for existing checkpoint-* directories in the output directory. If found, training resumes from the latest checkpoint via trainer.train(resume_from_checkpoint=True). This enables fault-tolerant training across long runs.
  6. Model Cache Disabling -- model.config.use_cache = False is set before training to disable KV-cache, which is incompatible with gradient checkpointing and unnecessary during training.

Usage

Use this pattern when:

  • Training LoRA or QLoRA adapters across multiple GPUs using DeepSpeed for memory efficiency.
  • The model and optimizer states would not fit on a single GPU without ZeRO partitioning.
  • Resuming training from a previously saved checkpoint.
  • Using the HuggingFace Trainer API with DeepSpeed backend.

Do not use this pattern when:

  • Training fits on a single GPU without memory pressure (standard Trainer suffices).
  • Using QLoRA with ZeRO Stage 3 (these are incompatible).
  • FSDP is the preferred distributed strategy (use a separate training script).

Theoretical Basis

ZeRO Optimization: In standard data-parallel training, each GPU maintains a full copy of model parameters, gradients, and optimizer states. For a model with N parameters using Adam optimizer, each GPU requires:

Memory per GPU = 2N (params, FP16) + 2N (gradients, FP16) + 12N (optimizer: FP32 params + FP32 momentum + FP32 variance)
             = 16N bytes

ZeRO eliminates this redundancy by partitioning across P GPUs:

Stage Partitioned Memory per GPU Communication Overhead
ZeRO-1 Optimizer states 4N + 12N/P Same as DDP
ZeRO-2 + Gradients 2N + (2N + 12N)/P Same as DDP
ZeRO-3 + Parameters 16N/P 1.5x DDP

ZeRO Stage 2 Configuration (FastChat): The deepspeed_config_s2.json enables:

- Optimizer state partitioning across GPUs
- CPU offloading of optimizer states (reduces GPU memory further)
- Contiguous gradient storage (improved memory locality)
- Communication overlap (hides allreduce latency behind compute)
- FP16 mixed precision (enabled via "auto" to follow Trainer settings)

ZeRO Stage 3 Configuration (FastChat): The deepspeed_config_s3.json additionally enables:

- Parameter partitioning across GPUs
- CPU offloading of both optimizer states and parameters (with pinned memory)
- Stage 3 prefetch bucket (500M params) for overlapping communication
- 16-bit weight gathering on model save (stage3_gather_16bit_weights_on_model_save: true)

Checkpoint Resume: DeepSpeed checkpoints contain the model state, optimizer state, scheduler state, and random number generator states. When resume_from_checkpoint=True is passed, the Trainer:

  1. Locates the latest checkpoint-* directory
  2. Loads the DeepSpeed engine state (including ZeRO partitioned states)
  3. Resumes from the exact training step, preserving learning rate schedule and optimizer momentum

Related Pages

Implemented By

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment