Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:Lm sys FastChat Quantized Model Loading

From Leeroopedia


Knowledge Sources
Domains NLP, Training, Quantization
Last Updated 2026-02-07 14:00 GMT

Overview

Loading pretrained causal language models with 4-bit NF4 quantization via BitsAndBytes, enabling QLoRA fine-tuning at dramatically reduced memory cost while preserving model quality.

Description

QLoRA (Quantized Low-Rank Adaptation) is a memory-efficient fine-tuning technique introduced by Dettmers et al. (2023). The core idea is to load the base model weights in 4-bit quantized format, freezing them entirely, and then attaching small trainable LoRA adapter layers in higher precision. This allows fine-tuning of models that would otherwise exceed available GPU memory.

The quantized model loading step in FastChat's train_lora.py addresses several critical concerns:

  1. 4-bit NF4 Quantization -- The NormalFloat 4-bit (NF4) data type is an information-theoretically optimal quantization format for normally distributed weights. Unlike uniform quantization, NF4 allocates more representation levels near the center of the weight distribution where density is highest, minimizing quantization error.
  2. Double Quantization -- The quantization constants (scale factors) themselves consume memory. Double quantization applies a second round of quantization to these constants, saving approximately 0.4 bits per parameter (roughly 3 GB for a 65B model) with negligible impact on quality.
  3. Compute Dtype Selection -- While weights are stored in 4-bit format, computation during forward and backward passes occurs in a higher-precision compute dtype. FastChat selects this dynamically: float16 when the --fp16 flag is set, bfloat16 when --bf16 is set, or float32 as fallback. BFloat16 is preferred on Ampere+ GPUs for its larger dynamic range.
  4. Device Mapping for DDP -- When using QLoRA with Distributed Data Parallel (DDP), each process must map the model to its own local GPU. FastChat sets device_map={"": LOCAL_RANK} to route the entire model to the correct device for each process.
  5. Incompatibility Guards -- FSDP (Fully Sharded Data Parallel) and DeepSpeed ZeRO Stage 3 are both currently incompatible with QLoRA because they attempt to shard parameters across devices, which conflicts with the device-local quantized weight storage. FastChat logs a warning when these configurations are detected.

Usage

Use this pattern when:

  • Fine-tuning large language models (7B+ parameters) on GPUs with limited VRAM (e.g., single 24GB GPU).
  • Running QLoRA training where the base model is loaded in 4-bit precision and only LoRA adapters are trained.
  • The training script is invoked with --q_lora True to activate quantized loading.
  • You need to balance memory savings against compute overhead from dequantization.

Do not use this pattern when:

  • Using FSDP or DeepSpeed ZeRO Stage 3, which are incompatible with QLoRA.
  • Full-precision LoRA training is desired (set --q_lora False instead).
  • The model fits comfortably in GPU memory without quantization.

Theoretical Basis

NF4 Quantization: Given a pretrained weight tensor W with approximately normal distribution, NF4 maps each weight to one of 16 levels optimally spaced for the Gaussian distribution:

W_quantized = quantize_nf4(W)

The 16 quantization levels are computed as the quantiles of the standard normal distribution N(0, 1), ensuring each bin captures approximately equal probability mass. This yields information-theoretically optimal quantization for normally distributed data.

Dequantization During Compute: During the forward pass, quantized weights are dequantized on-the-fly at the chosen compute precision:

h = dequantize(W_quantized).to(compute_dtype) @ x

The dequantized values are never stored persistently -- they are computed per-operation, keeping the memory footprint at the 4-bit level.

Double Quantization: Each block of 64 weights shares a single FP32 scale factor. Double quantization quantizes these scale factors to FP8, reducing their overhead from 32/64 = 0.5 bits per parameter to approximately 0.127 bits per parameter:

scale_quantized = quantize_fp8(scale_factors)
Memory saved ~ 0.37 bits/param ~ 3 GB for 65B model

Memory Reduction: For a model with N parameters, memory usage compares as:

Precision Bits/Param Memory for 7B Model
FP32 32 ~28 GB
FP16 / BF16 16 ~14 GB
NF4 4 ~3.5 GB
NF4 + Double Quant ~3.6 ~3.2 GB

The LoRA adapters (typically 0.1-1% of total parameters) remain in the compute dtype, adding negligible memory overhead.

Related Pages

Implemented By

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment