Principle:Lm sys FastChat Quantized Model Loading
| Knowledge Sources | |
|---|---|
| Domains | NLP, Training, Quantization |
| Last Updated | 2026-02-07 14:00 GMT |
Overview
Loading pretrained causal language models with 4-bit NF4 quantization via BitsAndBytes, enabling QLoRA fine-tuning at dramatically reduced memory cost while preserving model quality.
Description
QLoRA (Quantized Low-Rank Adaptation) is a memory-efficient fine-tuning technique introduced by Dettmers et al. (2023). The core idea is to load the base model weights in 4-bit quantized format, freezing them entirely, and then attaching small trainable LoRA adapter layers in higher precision. This allows fine-tuning of models that would otherwise exceed available GPU memory.
The quantized model loading step in FastChat's train_lora.py addresses several critical concerns:
- 4-bit NF4 Quantization -- The NormalFloat 4-bit (NF4) data type is an information-theoretically optimal quantization format for normally distributed weights. Unlike uniform quantization, NF4 allocates more representation levels near the center of the weight distribution where density is highest, minimizing quantization error.
- Double Quantization -- The quantization constants (scale factors) themselves consume memory. Double quantization applies a second round of quantization to these constants, saving approximately 0.4 bits per parameter (roughly 3 GB for a 65B model) with negligible impact on quality.
- Compute Dtype Selection -- While weights are stored in 4-bit format, computation during forward and backward passes occurs in a higher-precision compute dtype. FastChat selects this dynamically:
float16when the--fp16flag is set,bfloat16when--bf16is set, orfloat32as fallback. BFloat16 is preferred on Ampere+ GPUs for its larger dynamic range. - Device Mapping for DDP -- When using QLoRA with Distributed Data Parallel (DDP), each process must map the model to its own local GPU. FastChat sets
device_map={"": LOCAL_RANK}to route the entire model to the correct device for each process. - Incompatibility Guards -- FSDP (Fully Sharded Data Parallel) and DeepSpeed ZeRO Stage 3 are both currently incompatible with QLoRA because they attempt to shard parameters across devices, which conflicts with the device-local quantized weight storage. FastChat logs a warning when these configurations are detected.
Usage
Use this pattern when:
- Fine-tuning large language models (7B+ parameters) on GPUs with limited VRAM (e.g., single 24GB GPU).
- Running QLoRA training where the base model is loaded in 4-bit precision and only LoRA adapters are trained.
- The training script is invoked with
--q_lora Trueto activate quantized loading. - You need to balance memory savings against compute overhead from dequantization.
Do not use this pattern when:
- Using FSDP or DeepSpeed ZeRO Stage 3, which are incompatible with QLoRA.
- Full-precision LoRA training is desired (set
--q_lora Falseinstead). - The model fits comfortably in GPU memory without quantization.
Theoretical Basis
NF4 Quantization: Given a pretrained weight tensor W with approximately normal distribution, NF4 maps each weight to one of 16 levels optimally spaced for the Gaussian distribution:
W_quantized = quantize_nf4(W)
The 16 quantization levels are computed as the quantiles of the standard normal distribution N(0, 1), ensuring each bin captures approximately equal probability mass. This yields information-theoretically optimal quantization for normally distributed data.
Dequantization During Compute: During the forward pass, quantized weights are dequantized on-the-fly at the chosen compute precision:
h = dequantize(W_quantized).to(compute_dtype) @ x
The dequantized values are never stored persistently -- they are computed per-operation, keeping the memory footprint at the 4-bit level.
Double Quantization: Each block of 64 weights shares a single FP32 scale factor. Double quantization quantizes these scale factors to FP8, reducing their overhead from 32/64 = 0.5 bits per parameter to approximately 0.127 bits per parameter:
scale_quantized = quantize_fp8(scale_factors)
Memory saved ~ 0.37 bits/param ~ 3 GB for 65B model
Memory Reduction: For a model with N parameters, memory usage compares as:
| Precision | Bits/Param | Memory for 7B Model |
|---|---|---|
| FP32 | 32 | ~28 GB |
| FP16 / BF16 | 16 | ~14 GB |
| NF4 | 4 | ~3.5 GB |
| NF4 + Double Quant | ~3.6 | ~3.2 GB |
The LoRA adapters (typically 0.1-1% of total parameters) remain in the compute dtype, adding negligible memory overhead.