Environment:Lm sys FastChat LoRA QLoRA Training Environment
| Knowledge Sources | |
|---|---|
| Domains | Deep_Learning, NLP, Optimization |
| Last Updated | 2026-02-07 04:00 GMT |
Overview
DeepSpeed-based training environment with PEFT, BitsAndBytes 4-bit quantization (QLoRA), and optional Flash Attention for parameter-efficient fine-tuning.
Description
This environment provides the LoRA and QLoRA fine-tuning stack. It uses DeepSpeed ZeRO Stage 2 or 3 for distributed training with CPU optimizer offloading, PEFT for LoRA adapter injection, and optionally BitsAndBytes for 4-bit NF4 quantization (QLoRA). The default LoRA configuration targets `q_proj` and `v_proj` layers with rank 8 and alpha 16. Flash Attention is available as an opt-in flag.
Usage
Use this environment for parameter-efficient fine-tuning via LoRA or QLoRA. It is the mandatory prerequisite for running `train_lora.py`. Use DeepSpeed ZeRO-2 for QLoRA (ZeRO-3 is incompatible) and ZeRO-3 for full LoRA with parameter offloading.
System Requirements
| Category | Requirement | Notes |
|---|---|---|
| OS | Linux | DeepSpeed and BitsAndBytes require Linux |
| Hardware | 1+ NVIDIA GPU with 16GB+ VRAM | QLoRA fits 7B models on single 16GB GPU |
| CUDA | 11.8+ | BitsAndBytes requires CUDA |
| Disk | 100GB+ SSD | Model weights + LoRA checkpoints |
Dependencies
Python Packages
- `torch` — PyTorch with CUDA
- `transformers` >= 4.31.0 — Model loading, Trainer, BitsAndBytesConfig
- `deepspeed` — Distributed training (hard dependency, unguarded import)
- `peft` — LoRA adapter injection (prepare_model_for_kbit_training, get_peft_model)
- `bitsandbytes` — 4-bit NF4 quantization for QLoRA (via `transformers.BitsAndBytesConfig`)
- `flash-attn` >= 2.0 — Optional, enabled via `--flash_attn True`
DeepSpeed Configurations
ZeRO Stage 2 (`playground/deepspeed_config_s2.json`) — Use with QLoRA:
- CPU optimizer offload, contiguous gradients, communication overlap
- Does NOT offload parameters (incompatible with QLoRA)
ZeRO Stage 3 (`playground/deepspeed_config_s3.json`) — Use with full LoRA:
- CPU optimizer + parameter offload with pinned memory
- Gathers 16-bit weights on model save
- Advanced tuning: `stage3_max_live_parameters=1e9`, `stage3_prefetch_bucket_size=5e8`
Credentials
- `WORLD_SIZE`: Set automatically by DeepSpeed launcher for distributed training
- `LOCAL_RANK`: Set automatically; used for QLoRA device mapping in DDP mode
- `WANDB_API_KEY`: Optional, for experiment tracking
Quick Install
# Install LoRA training dependencies
pip install "fschat[model_worker]" deepspeed
# For QLoRA (4-bit quantization)
pip install bitsandbytes
# For Flash Attention support
pip install flash-attn --no-build-isolation
Code Evidence
DeepSpeed hard dependency from `fastchat/train/train_lora.py:24-25`:
from deepspeed import zero
from deepspeed.runtime.zero.partition_parameters import ZeroParamStatus
FSDP/ZeRO-3 incompatibility warning from `fastchat/train/train_lora.py:123-126`:
if lora_args.q_lora:
device_map = {"": int(os.environ.get("LOCAL_RANK") or 0)} if ddp else None
if len(training_args.fsdp) > 0 or deepspeed.is_deepspeed_zero3_enabled():
logging.warning("FSDP and ZeRO3 are both currently incompatible with QLoRA.")
QLoRA 4-bit NF4 configuration from `fastchat/train/train_lora.py:138-143`:
quantization_config=BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=compute_dtype,
)
Default LoRA hyperparameters from `fastchat/train/train_lora.py:56-65`:
class LoraArguments:
lora_r: int = 8
lora_alpha: int = 16
lora_dropout: float = 0.05
lora_target_modules: typing.List[str] = field(
default_factory=lambda: ["q_proj", "v_proj"]
)
lora_bias: str = "none"
q_lora: bool = False
Flash Attention dtype casting for LoRA from `fastchat/train/train_lora.py:166-172`:
if training_args.flash_attn:
for name, module in model.named_modules():
if "norm" in name:
module = module.to(compute_dtype)
if "lm_head" in name or "embed_tokens" in name:
if hasattr(module, "weight"):
module = module.to(compute_dtype)
Common Errors
| Error Message | Cause | Solution |
|---|---|---|
| `FSDP and ZeRO3 are both currently incompatible with QLoRA` | Using QLoRA with ZeRO-3 | Use ZeRO Stage 2 config (`deepspeed_config_s2.json`) for QLoRA |
| `ImportError: No module named 'deepspeed'` | DeepSpeed not installed | `pip install deepspeed` |
| `CUDA out of memory` during QLoRA | Batch size too large | Set `--per_device_train_batch_size 1` with `--gradient_accumulation_steps 16` |
| Mixed precision errors with Flash Attention + LoRA | Norm and embedding layers in wrong dtype | The code auto-casts `norm`, `lm_head`, and `embed_tokens` layers when `--flash_attn True` |
Compatibility Notes
- QLoRA + ZeRO: QLoRA is only compatible with DeepSpeed ZeRO Stage 2. ZeRO Stage 3 does not work because it attempts to shard quantized parameters.
- Multi-GPU QLoRA without DDP: When running QLoRA on multiple GPUs without DDP (`WORLD_SIZE=1`), the code sets `model.is_parallelizable = True` and `model.model_parallel = True` to bypass Trainer's DataParallel.
- Gradient Checkpointing: When enabled, `model.enable_input_require_grads()` must be called to allow gradients through the input embeddings (required for PEFT).
- ZeRO-3 Model Saving: Uses `trainer.model_wrapped._zero3_consolidated_16bit_state_dict()` to gather the full model, then PEFT's `save_pretrained` extracts only LoRA weights.