Principle:Hiyouga LLaMA Factory Model Quantization
| Knowledge Sources | |
|---|---|
| Domains | Model Compression, Deep Learning, Efficient Inference |
| Last Updated | 2026-02-06 19:00 GMT |
Overview
A model compression technique that reduces the memory footprint and computational cost of neural networks by representing weights and/or activations with lower-precision numerical formats.
Description
Model quantization is the process of converting model parameters from high-precision floating-point representations (typically FP16 or BF16) to lower-precision formats such as 8-bit integers, 4-bit integers, or specialized floating-point formats. This dramatically reduces memory consumption and can accelerate inference, making it possible to run large language models on hardware with limited resources.
Quantization is essential in the LLM ecosystem because modern language models often exceed the memory capacity of individual GPUs. A 70-billion-parameter model in FP16 requires approximately 140 GB of memory, but 4-bit quantization reduces this to approximately 35 GB, fitting within a single high-end consumer GPU.
LLaMA-Factory supports two broad categories of quantization:
Post-Training Quantization (PTQ) quantizes an already-trained model without additional training. Methods include:
- GPTQ (GPT Quantization): A one-shot weight quantization method that uses second-order information (approximate Hessian) to minimize quantization error layer by layer.
- AWQ (Activation-Aware Weight Quantization): Identifies salient weight channels based on activation magnitudes and protects them during quantization.
- AQLM (Additive Quantization of Language Models): Uses additive codebook quantization for extreme compression (2-bit).
- HQQ (Half-Quadratic Quantization): A fast quantization method based on half-quadratic optimization.
- EETQ (Easy and Efficient Transformer Quantization): INT8 quantization designed for efficient transformer inference.
On-the-Fly Quantization (OTF) applies quantization during model loading for training:
- BitsAndBytes (BnB): The primary quantization backend for QLoRA training, supporting both 8-bit and 4-bit quantization with NF4 (Normal Float 4) and FP4 data types. 4-bit NF4 with double quantization is the standard QLoRA configuration.
Export Quantization uses GPTQ to quantize a model for deployment via calibration on a representative dataset.
Usage
Use quantization when you want to:
- Reduce GPU memory requirements for inference or fine-tuning.
- Fine-tune large models on consumer hardware using QLoRA (4-bit quantization + LoRA).
- Deploy models on resource-constrained devices.
- Export quantized models for efficient serving with frameworks like vLLM or TGI.
- Balance model quality against computational cost.
Quantization is generally recommended when GPU memory is a bottleneck. The quality-efficiency tradeoff depends on the quantization method and bit width, with 4-bit NF4 offering an excellent balance for most use cases.
Theoretical Basis
Uniform Quantization
The basic principle of quantization maps a continuous range of floating-point values to a discrete set of levels. For symmetric uniform quantization with bits:
where is the scale factor computed as and denotes rounding to the nearest integer.
NF4 (Normal Float 4-bit)
NF4, introduced in QLoRA, is a quantization data type specifically designed for normally-distributed neural network weights. It divides the quantization levels non-uniformly to match the empirical distribution of pretrained weights:
where the quantization levels are chosen so that each level covers an equal probability mass under the normal distribution. This information-theoretically optimal quantile quantization minimizes expected quantization error for Gaussian-distributed weights.
Double Quantization
Double quantization (DQ) reduces the memory overhead of quantization constants themselves. In standard blockwise quantization, each block of 64 weights shares a FP32 scale factor, adding 0.5 bits per parameter. Double quantization quantizes these scale factors to FP8, reducing the overhead to approximately 0.127 bits per parameter:
where is the first-level block size and is the second-level block size.
GPTQ Layer-wise Quantization
GPTQ quantizes weights one layer at a time by solving:
where is a calibration dataset, is the original weight, and is the quantized weight. It uses the inverse Hessian to determine the optimal quantization order and error compensation, processing columns in order of increasing quantization difficulty.
AWQ Salient Channel Protection
AWQ observes that a small fraction of weight channels disproportionately affect model output. It identifies salient channels based on activation magnitudes and applies per-channel scaling before quantization:
where is chosen to minimize quantization error on the salient channels, effectively trading off precision between important and less important channels.
QLoRA: Quantization + LoRA
QLoRA combines 4-bit quantization with LoRA to enable fine-tuning of very large models. The base model weights are stored in 4-bit NF4 format, while the LoRA adapter weights ( and matrices) are maintained in FP32/BF16 for training. During the forward pass, quantized weights are dequantized to BF16 on the fly, enabling gradient computation through the LoRA path while keeping memory usage minimal.