Environment:Lm sys FastChat LoRA QLoRA Training Environment

Knowledge Sources	lm-sys/FastChat Training Docs
Domains	Deep_Learning, NLP, Optimization
Last Updated	2026-02-07 04:00 GMT

Overview

DeepSpeed-based training environment with PEFT, BitsAndBytes 4-bit quantization (QLoRA), and optional Flash Attention for parameter-efficient fine-tuning.

Description

This environment provides the LoRA and QLoRA fine-tuning stack. It uses DeepSpeed ZeRO Stage 2 or 3 for distributed training with CPU optimizer offloading, PEFT for LoRA adapter injection, and optionally BitsAndBytes for 4-bit NF4 quantization (QLoRA). The default LoRA configuration targets `q_proj` and `v_proj` layers with rank 8 and alpha 16. Flash Attention is available as an opt-in flag.

Usage

Use this environment for parameter-efficient fine-tuning via LoRA or QLoRA. It is the mandatory prerequisite for running `train_lora.py`. Use DeepSpeed ZeRO-2 for QLoRA (ZeRO-3 is incompatible) and ZeRO-3 for full LoRA with parameter offloading.

System Requirements

Category	Requirement	Notes
OS	Linux	DeepSpeed and BitsAndBytes require Linux
Hardware	1+ NVIDIA GPU with 16GB+ VRAM	QLoRA fits 7B models on single 16GB GPU
CUDA	11.8+	BitsAndBytes requires CUDA
Disk	100GB+ SSD	Model weights + LoRA checkpoints

Dependencies

Python Packages

`torch` — PyTorch with CUDA
`transformers` >= 4.31.0 — Model loading, Trainer, BitsAndBytesConfig
`deepspeed` — Distributed training (hard dependency, unguarded import)
`peft` — LoRA adapter injection (prepare_model_for_kbit_training, get_peft_model)
`bitsandbytes` — 4-bit NF4 quantization for QLoRA (via `transformers.BitsAndBytesConfig`)
`flash-attn` >= 2.0 — Optional, enabled via `--flash_attn True`

DeepSpeed Configurations

ZeRO Stage 2 (`playground/deepspeed_config_s2.json`) — Use with QLoRA:

CPU optimizer offload, contiguous gradients, communication overlap
Does NOT offload parameters (incompatible with QLoRA)

ZeRO Stage 3 (`playground/deepspeed_config_s3.json`) — Use with full LoRA:

CPU optimizer + parameter offload with pinned memory
Gathers 16-bit weights on model save
Advanced tuning: `stage3_max_live_parameters=1e9`, `stage3_prefetch_bucket_size=5e8`

Credentials

`WORLD_SIZE`: Set automatically by DeepSpeed launcher for distributed training
`LOCAL_RANK`: Set automatically; used for QLoRA device mapping in DDP mode
`WANDB_API_KEY`: Optional, for experiment tracking

Quick Install

# Install LoRA training dependencies
pip install "fschat[model_worker]" deepspeed

# For QLoRA (4-bit quantization)
pip install bitsandbytes

# For Flash Attention support
pip install flash-attn --no-build-isolation

Code Evidence

DeepSpeed hard dependency from `fastchat/train/train_lora.py:24-25`:

from deepspeed import zero
from deepspeed.runtime.zero.partition_parameters import ZeroParamStatus

FSDP/ZeRO-3 incompatibility warning from `fastchat/train/train_lora.py:123-126`:

if lora_args.q_lora:
    device_map = {"": int(os.environ.get("LOCAL_RANK") or 0)} if ddp else None
    if len(training_args.fsdp) > 0 or deepspeed.is_deepspeed_zero3_enabled():
        logging.warning("FSDP and ZeRO3 are both currently incompatible with QLoRA.")

QLoRA 4-bit NF4 configuration from `fastchat/train/train_lora.py:138-143`:

quantization_config=BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=compute_dtype,
)

Default LoRA hyperparameters from `fastchat/train/train_lora.py:56-65`:

class LoraArguments:
    lora_r: int = 8
    lora_alpha: int = 16
    lora_dropout: float = 0.05
    lora_target_modules: typing.List[str] = field(
        default_factory=lambda: ["q_proj", "v_proj"]
    )
    lora_bias: str = "none"
    q_lora: bool = False

Flash Attention dtype casting for LoRA from `fastchat/train/train_lora.py:166-172`:

if training_args.flash_attn:
    for name, module in model.named_modules():
        if "norm" in name:
            module = module.to(compute_dtype)
        if "lm_head" in name or "embed_tokens" in name:
            if hasattr(module, "weight"):
                module = module.to(compute_dtype)

Common Errors

Error Message	Cause	Solution
`FSDP and ZeRO3 are both currently incompatible with QLoRA`	Using QLoRA with ZeRO-3	Use ZeRO Stage 2 config (`deepspeed_config_s2.json`) for QLoRA
`ImportError: No module named 'deepspeed'`	DeepSpeed not installed	`pip install deepspeed`
`CUDA out of memory` during QLoRA	Batch size too large	Set `--per_device_train_batch_size 1` with `--gradient_accumulation_steps 16`
Mixed precision errors with Flash Attention + LoRA	Norm and embedding layers in wrong dtype	The code auto-casts `norm`, `lm_head`, and `embed_tokens` layers when `--flash_attn True`

Compatibility Notes

QLoRA + ZeRO: QLoRA is only compatible with DeepSpeed ZeRO Stage 2. ZeRO Stage 3 does not work because it attempts to shard quantized parameters.
Multi-GPU QLoRA without DDP: When running QLoRA on multiple GPUs without DDP (`WORLD_SIZE=1`), the code sets `model.is_parallelizable = True` and `model.model_parallel = True` to bypass Trainer's DataParallel.
Gradient Checkpointing: When enabled, `model.enable_input_require_grads()` must be called to allow gradients through the input embeddings (required for PEFT).
ZeRO-3 Model Saving: Uses `trainer.model_wrapped._zero3_consolidated_16bit_state_dict()` to gather the full model, then PEFT's `save_pretrained` extracts only LoRA weights.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment