Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Principle:Huggingface Alignment handbook QLoRA Quantized Finetuning

From Leeroopedia


Knowledge Sources
Domains NLP, Deep_Learning, Optimization
Last Updated 2026-02-07 00:00 GMT

Overview

A parameter-efficient fine-tuning technique that combines 4-bit quantization of the base model with low-rank adaptation (LoRA) to enable training large language models on consumer-grade GPUs.

Description

QLoRA (Quantized Low-Rank Adaptation) enables fine-tuning of large language models on hardware with limited GPU memory by:

  1. 4-bit NormalFloat (NF4) quantization: The pretrained model weights are quantized to 4-bit precision using a novel NF4 data type optimized for normally distributed weights
  2. Double quantization: The quantization constants themselves are quantized to further reduce memory
  3. Paged optimizers: Optimizer states are offloaded to CPU when GPU memory is scarce
  4. LoRA adapters: Only small low-rank adapter matrices are trained in full precision while the base model remains frozen in 4-bit

This combination reduces the memory footprint of a 7B parameter model from ~28GB (FP16) to ~5GB (4-bit + LoRA), making single-GPU fine-tuning feasible on consumer hardware.

In the alignment-handbook, QLoRA is enabled through YAML config flags (load_in_4bit: true, use_peft: true) and the same training scripts are used for both full fine-tuning and QLoRA by toggling these flags.

Usage

Use QLoRA when:

  • GPU memory is limited (single GPU, consumer-grade hardware)
  • Training a 7B+ parameter model on a single GPU
  • Near full fine-tuning quality is acceptable (QLoRA achieves ~99% of full fine-tuning performance)
  • Fast experimentation with different hyperparameters is desired

Theoretical Basis

QLoRA combines quantization with LoRA:

Weffective=QNF4(Wpretrained)+αrBA

Where:

  • QNF4 quantizes the pretrained weights to 4-bit NormalFloat
  • Bd×r and Ar×k are the trainable LoRA matrices
  • r is the LoRA rank (e.g., 16 for SFT, 128 for DPO)
  • α is the LoRA scaling factor
# Abstract QLoRA flow (NOT real implementation)
# 1. Load base model in 4-bit
quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=bfloat16,
)
model = load_model(model_name, quantization_config=quantization_config)

# 2. Inject LoRA adapters (trainable)
lora_config = LoraConfig(r=16, target_modules=[q,k,v,o,gate,up,down])
model = get_peft_model(model, lora_config)
# Only LoRA params are trainable; base model weights frozen in 4-bit

# 3. Train with higher learning rate (2e-4 vs 2e-5 for full)
train(model, learning_rate=2e-4, optim="paged_adamw_32bit")

Key hyperparameter differences from full fine-tuning:

  • Learning rate: 10x higher (2e-4 vs 2e-5) because only LoRA adapters are updated
  • Optimizer: Paged AdamW 32-bit for memory efficiency
  • LoRA rank: 16 for SFT, 128 for DPO (DPO needs more capacity)

Related Pages

Implemented By

Uses Heuristic

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment