Principle:Lm sys FastChat DeepSpeed LoRA Training
| Knowledge Sources | |
|---|---|
| Domains | NLP, Training, Distributed Systems |
| Last Updated | 2026-02-07 14:00 GMT |
Overview
Training LoRA adapters on pretrained language models using DeepSpeed ZeRO optimization stages for distributed, memory-efficient fine-tuning through the HuggingFace Trainer integration.
Description
DeepSpeed is a deep learning optimization library developed by Microsoft that enables efficient distributed training through its ZeRO (Zero Redundancy Optimizer) technology. In FastChat's LoRA training pipeline, DeepSpeed integrates with the HuggingFace Trainer class to provide seamless multi-GPU training with optimizer state partitioning, gradient partitioning, and optional parameter partitioning.
The training process in FastChat's train_lora.py encompasses several key aspects:
- ZeRO Optimization Stages -- FastChat provides two DeepSpeed configuration files:
- ZeRO Stage 2 (
playground/deepspeed_config_s2.json): Partitions optimizer states and gradients across GPUs. Includes CPU offloading of optimizer states, contiguous gradient allocation, and communication overlap. This is the recommended configuration for LoRA and QLoRA training. - ZeRO Stage 3 (
playground/deepspeed_config_s3.json): Additionally partitions model parameters across GPUs. Includes both optimizer and parameter CPU offloading, pinned memory, and automatic 16-bit weight gathering on model save. Note that ZeRO-3 is incompatible with QLoRA.
- ZeRO Stage 2 (
- HuggingFace Trainer Integration -- The
Trainerclass automatically detects the--deepspeedargument and initializes the DeepSpeed engine. The trainer handles distributed data loading, gradient accumulation, mixed precision, and checkpoint management. - Training Data Pipeline -- The supervised data module (from
make_supervised_data_module()) provides the tokenized conversation data. The Trainer passes this to DeepSpeed's distributed data sampler automatically. - FlashAttention Support -- When
--flash_attn Trueis set, the script appliesreplace_llama_attn_with_flash_attn()to monkey-patch LLaMA's attention implementation with FlashAttention for faster training throughput. - Checkpoint Resume -- The training logic checks for existing
checkpoint-*directories in the output directory. If found, training resumes from the latest checkpoint viatrainer.train(resume_from_checkpoint=True). This enables fault-tolerant training across long runs. - Model Cache Disabling --
model.config.use_cache = Falseis set before training to disable KV-cache, which is incompatible with gradient checkpointing and unnecessary during training.
Usage
Use this pattern when:
- Training LoRA or QLoRA adapters across multiple GPUs using DeepSpeed for memory efficiency.
- The model and optimizer states would not fit on a single GPU without ZeRO partitioning.
- Resuming training from a previously saved checkpoint.
- Using the HuggingFace Trainer API with DeepSpeed backend.
Do not use this pattern when:
- Training fits on a single GPU without memory pressure (standard Trainer suffices).
- Using QLoRA with ZeRO Stage 3 (these are incompatible).
- FSDP is the preferred distributed strategy (use a separate training script).
Theoretical Basis
ZeRO Optimization: In standard data-parallel training, each GPU maintains a full copy of model parameters, gradients, and optimizer states. For a model with N parameters using Adam optimizer, each GPU requires:
Memory per GPU = 2N (params, FP16) + 2N (gradients, FP16) + 12N (optimizer: FP32 params + FP32 momentum + FP32 variance)
= 16N bytes
ZeRO eliminates this redundancy by partitioning across P GPUs:
| Stage | Partitioned | Memory per GPU | Communication Overhead |
|---|---|---|---|
| ZeRO-1 | Optimizer states | 4N + 12N/P |
Same as DDP |
| ZeRO-2 | + Gradients | 2N + (2N + 12N)/P |
Same as DDP |
| ZeRO-3 | + Parameters | 16N/P |
1.5x DDP |
ZeRO Stage 2 Configuration (FastChat): The deepspeed_config_s2.json enables:
- Optimizer state partitioning across GPUs
- CPU offloading of optimizer states (reduces GPU memory further)
- Contiguous gradient storage (improved memory locality)
- Communication overlap (hides allreduce latency behind compute)
- FP16 mixed precision (enabled via "auto" to follow Trainer settings)
ZeRO Stage 3 Configuration (FastChat): The deepspeed_config_s3.json additionally enables:
- Parameter partitioning across GPUs
- CPU offloading of both optimizer states and parameters (with pinned memory)
- Stage 3 prefetch bucket (500M params) for overlapping communication
- 16-bit weight gathering on model save (stage3_gather_16bit_weights_on_model_save: true)
Checkpoint Resume: DeepSpeed checkpoints contain the model state, optimizer state, scheduler state, and random number generator states. When resume_from_checkpoint=True is passed, the Trainer:
- Locates the latest
checkpoint-*directory - Loads the DeepSpeed engine state (including ZeRO partitioned states)
- Resumes from the exact training step, preserving learning rate schedule and optimizer momentum
Related Pages
Implemented By
- Implementation:Lm_sys_FastChat_HF_Trainer_Train_DeepSpeed
- Implementation:Lm_sys_FastChat_Train_LoRA_T5
- Implemented by (variant): Implementation:Lm_sys_FastChat_Train_LoRA_T5 — T5 LoRA/QLoRA training with DeepSpeed