Principle:Huggingface Alignment handbook QLoRA Quantized Finetuning
| Knowledge Sources | |
|---|---|
| Domains | NLP, Deep_Learning, Optimization |
| Last Updated | 2026-02-07 00:00 GMT |
Overview
A parameter-efficient fine-tuning technique that combines 4-bit quantization of the base model with low-rank adaptation (LoRA) to enable training large language models on consumer-grade GPUs.
Description
QLoRA (Quantized Low-Rank Adaptation) enables fine-tuning of large language models on hardware with limited GPU memory by:
- 4-bit NormalFloat (NF4) quantization: The pretrained model weights are quantized to 4-bit precision using a novel NF4 data type optimized for normally distributed weights
- Double quantization: The quantization constants themselves are quantized to further reduce memory
- Paged optimizers: Optimizer states are offloaded to CPU when GPU memory is scarce
- LoRA adapters: Only small low-rank adapter matrices are trained in full precision while the base model remains frozen in 4-bit
This combination reduces the memory footprint of a 7B parameter model from ~28GB (FP16) to ~5GB (4-bit + LoRA), making single-GPU fine-tuning feasible on consumer hardware.
In the alignment-handbook, QLoRA is enabled through YAML config flags (load_in_4bit: true, use_peft: true) and the same training scripts are used for both full fine-tuning and QLoRA by toggling these flags.
Usage
Use QLoRA when:
- GPU memory is limited (single GPU, consumer-grade hardware)
- Training a 7B+ parameter model on a single GPU
- Near full fine-tuning quality is acceptable (QLoRA achieves ~99% of full fine-tuning performance)
- Fast experimentation with different hyperparameters is desired
Theoretical Basis
QLoRA combines quantization with LoRA:
Where:
- quantizes the pretrained weights to 4-bit NormalFloat
- and are the trainable LoRA matrices
- is the LoRA rank (e.g., 16 for SFT, 128 for DPO)
- is the LoRA scaling factor
# Abstract QLoRA flow (NOT real implementation)
# 1. Load base model in 4-bit
quantization_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=bfloat16,
)
model = load_model(model_name, quantization_config=quantization_config)
# 2. Inject LoRA adapters (trainable)
lora_config = LoraConfig(r=16, target_modules=[q,k,v,o,gate,up,down])
model = get_peft_model(model, lora_config)
# Only LoRA params are trainable; base model weights frozen in 4-bit
# 3. Train with higher learning rate (2e-4 vs 2e-5 for full)
train(model, learning_rate=2e-4, optim="paged_adamw_32bit")
Key hyperparameter differences from full fine-tuning:
- Learning rate: 10x higher (2e-4 vs 2e-5) because only LoRA adapters are updated
- Optimizer: Paged AdamW 32-bit for memory efficiency
- LoRA rank: 16 for SFT, 128 for DPO (DPO needs more capacity)