Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Principle:Hiyouga LLaMA Factory Model Quantization

From Leeroopedia
Revision as of 18:17, 16 February 2026 by Admin (talk | contribs) (Auto-imported from principles/Hiyouga_LLaMA_Factory_Model_Quantization.md)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)


Knowledge Sources
Domains Model Compression, Deep Learning, Efficient Inference
Last Updated 2026-02-06 19:00 GMT

Overview

A model compression technique that reduces the memory footprint and computational cost of neural networks by representing weights and/or activations with lower-precision numerical formats.

Description

Model quantization is the process of converting model parameters from high-precision floating-point representations (typically FP16 or BF16) to lower-precision formats such as 8-bit integers, 4-bit integers, or specialized floating-point formats. This dramatically reduces memory consumption and can accelerate inference, making it possible to run large language models on hardware with limited resources.

Quantization is essential in the LLM ecosystem because modern language models often exceed the memory capacity of individual GPUs. A 70-billion-parameter model in FP16 requires approximately 140 GB of memory, but 4-bit quantization reduces this to approximately 35 GB, fitting within a single high-end consumer GPU.

LLaMA-Factory supports two broad categories of quantization:

Post-Training Quantization (PTQ) quantizes an already-trained model without additional training. Methods include:

  • GPTQ (GPT Quantization): A one-shot weight quantization method that uses second-order information (approximate Hessian) to minimize quantization error layer by layer.
  • AWQ (Activation-Aware Weight Quantization): Identifies salient weight channels based on activation magnitudes and protects them during quantization.
  • AQLM (Additive Quantization of Language Models): Uses additive codebook quantization for extreme compression (2-bit).
  • HQQ (Half-Quadratic Quantization): A fast quantization method based on half-quadratic optimization.
  • EETQ (Easy and Efficient Transformer Quantization): INT8 quantization designed for efficient transformer inference.

On-the-Fly Quantization (OTF) applies quantization during model loading for training:

  • BitsAndBytes (BnB): The primary quantization backend for QLoRA training, supporting both 8-bit and 4-bit quantization with NF4 (Normal Float 4) and FP4 data types. 4-bit NF4 with double quantization is the standard QLoRA configuration.

Export Quantization uses GPTQ to quantize a model for deployment via calibration on a representative dataset.

Usage

Use quantization when you want to:

  • Reduce GPU memory requirements for inference or fine-tuning.
  • Fine-tune large models on consumer hardware using QLoRA (4-bit quantization + LoRA).
  • Deploy models on resource-constrained devices.
  • Export quantized models for efficient serving with frameworks like vLLM or TGI.
  • Balance model quality against computational cost.

Quantization is generally recommended when GPU memory is a bottleneck. The quality-efficiency tradeoff depends on the quantization method and bit width, with 4-bit NF4 offering an excellent balance for most use cases.

Theoretical Basis

Uniform Quantization

The basic principle of quantization maps a continuous range of floating-point values to a discrete set of levels. For symmetric uniform quantization with b bits:

Q(w)=clamp(ws,2b1,2b11)

where s is the scale factor computed as s=max(|w|)2b11 and denotes rounding to the nearest integer.

NF4 (Normal Float 4-bit)

NF4, introduced in QLoRA, is a quantization data type specifically designed for normally-distributed neural network weights. It divides the quantization levels non-uniformly to match the empirical distribution of pretrained weights:

NF4 levels={qi:P(wqi)=i2b}

where the quantization levels qi are chosen so that each level covers an equal probability mass under the normal distribution. This information-theoretically optimal quantile quantization minimizes expected quantization error for Gaussian-distributed weights.

Double Quantization

Double quantization (DQ) reduces the memory overhead of quantization constants themselves. In standard blockwise quantization, each block of 64 weights shares a FP32 scale factor, adding 0.5 bits per parameter. Double quantization quantizes these scale factors to FP8, reducing the overhead to approximately 0.127 bits per parameter:

Memory per param=b+32B1b+8B1+32B1B2

where B1 is the first-level block size and B2 is the second-level block size.

GPTQ Layer-wise Quantization

GPTQ quantizes weights one layer at a time by solving:

W^=argminW^WXW^X22

where X is a calibration dataset, W is the original weight, and W^ is the quantized weight. It uses the inverse Hessian H1=(2XX)1 to determine the optimal quantization order and error compensation, processing columns in order of increasing quantization difficulty.

AWQ Salient Channel Protection

AWQ observes that a small fraction of weight channels disproportionately affect model output. It identifies salient channels based on activation magnitudes and applies per-channel scaling before quantization:

Q(ws)xswx

where s is chosen to minimize quantization error on the salient channels, effectively trading off precision between important and less important channels.

QLoRA: Quantization + LoRA

QLoRA combines 4-bit quantization with LoRA to enable fine-tuning of very large models. The base model weights are stored in 4-bit NF4 format, while the LoRA adapter weights (A and B matrices) are maintained in FP32/BF16 for training. During the forward pass, quantized weights are dequantized to BF16 on the fly, enabling gradient computation through the LoRA path while keeping memory usage minimal.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment