Principle:Huggingface Alignment handbook QLoRA Quantized Finetuning

Knowledge Sources	Alignment Handbook QLoRA: Efficient Finetuning of Quantized LLMs BitsAndBytes
Domains	NLP, Deep_Learning, Optimization
Last Updated	2026-02-07 00:00 GMT

Overview

A parameter-efficient fine-tuning technique that combines 4-bit quantization of the base model with low-rank adaptation (LoRA) to enable training large language models on consumer-grade GPUs.

Description

QLoRA (Quantized Low-Rank Adaptation) enables fine-tuning of large language models on hardware with limited GPU memory by:

4-bit NormalFloat (NF4) quantization: The pretrained model weights are quantized to 4-bit precision using a novel NF4 data type optimized for normally distributed weights
Double quantization: The quantization constants themselves are quantized to further reduce memory
Paged optimizers: Optimizer states are offloaded to CPU when GPU memory is scarce
LoRA adapters: Only small low-rank adapter matrices are trained in full precision while the base model remains frozen in 4-bit

This combination reduces the memory footprint of a 7B parameter model from ~28GB (FP16) to ~5GB (4-bit + LoRA), making single-GPU fine-tuning feasible on consumer hardware.

In the alignment-handbook, QLoRA is enabled through YAML config flags (load_in_4bit: true, use_peft: true) and the same training scripts are used for both full fine-tuning and QLoRA by toggling these flags.

Usage

Use QLoRA when:

GPU memory is limited (single GPU, consumer-grade hardware)
Training a 7B+ parameter model on a single GPU
Near full fine-tuning quality is acceptable (QLoRA achieves ~99% of full fine-tuning performance)
Fast experimentation with different hyperparameters is desired

Theoretical Basis

QLoRA combines quantization with LoRA:

$W_{e f f e c t i v e} = Q_{N F 4} (W_{p r e t r a i n e d}) + \frac{α}{r} \cdot B A$

Where:

$Q_{N F 4}$ quantizes the pretrained weights to 4-bit NormalFloat
$B \in ℝ^{d \times r}$ and $A \in ℝ^{r \times k}$ are the trainable LoRA matrices
$r$ is the LoRA rank (e.g., 16 for SFT, 128 for DPO)
$α$ is the LoRA scaling factor

# Abstract QLoRA flow (NOT real implementation)
# 1. Load base model in 4-bit
quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=bfloat16,
)
model = load_model(model_name, quantization_config=quantization_config)

# 2. Inject LoRA adapters (trainable)
lora_config = LoraConfig(r=16, target_modules=[q,k,v,o,gate,up,down])
model = get_peft_model(model, lora_config)
# Only LoRA params are trainable; base model weights frozen in 4-bit

# 3. Train with higher learning rate (2e-4 vs 2e-5 for full)
train(model, learning_rate=2e-4, optim="paged_adamw_32bit")

Key hyperparameter differences from full fine-tuning:

Learning rate: 10x higher (2e-4 vs 2e-5) because only LoRA adapters are updated
Optimizer: Paged AdamW 32-bit for memory efficiency
LoRA rank: 16 for SFT, 128 for DPO (DPO needs more capacity)

Related Pages

Implemented By

Implementation:Huggingface_Alignment_handbook_Get_Model_Quantized

Uses Heuristic

Heuristic:Huggingface_Alignment_handbook_QLoRA_Learning_Rate_Scaling

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment