Principle:Huggingface Optimum GPTQ Quantizer Configuration
Overview
Configuration schema for GPTQ post-training quantization that defines bit-width, group size, dampening, and quantization strategy parameters.
Description
GPTQ (Generative Pre-trained Transformer Quantization) reduces model weights to low-bit representations (2-8 bits) using Hessian-based calibration. The quantizer configuration defines the key parameters that control the quantization process:
- Bit-width (
bits) controls the precision/compression trade-off. Supported values are 2, 3, 4, and 8 bits. - Group size (
group_size) determines how many weights share quantization parameters. The default is 128; setting it to -1 enables per-column quantization. - Dampening percent (
damp_percent) stabilizes the Hessian inverse computation. The recommended default is 0.1. - Activation ordering (
desc_act) can improve quantization quality by quantizing columns in order of decreasing activation magnitude, at the cost of slower inference. - Symmetric quantization (
sym) toggles between symmetric and asymmetric quantization modes. Asymmetric quantization requiresgptqmodel. - Weight format (
format) selects betweengptq(v1) andgptq_v2formats. The v2 format is used internally bygptqmodelfor asymmetric support. - True sequential (
true_sequential) enables layer-wise quantization within a single Transformer block, so each layer is quantized using inputs that have passed through previously quantized layers.
The configuration is validated at initialization: bits must be in [2, 3, 4, 8], group_size must be greater than 0 or equal to -1, and damp_percent must be strictly between 0 and 1.
Usage
Use when setting up GPTQ quantization for any large language model to reduce memory footprint. The configuration parameters are passed to the GPTQQuantizer constructor and determine all aspects of the quantization behavior.
from optimum.gptq import GPTQQuantizer
quantizer = GPTQQuantizer(
bits=4,
dataset="wikitext2",
group_size=128,
damp_percent=0.1,
desc_act=False,
sym=True,
true_sequential=True,
format="gptq",
)
Theoretical Basis
GPTQ is based on the OBQ (Optimal Brain Quantization) framework. For each weight column w, GPTQ solves:
argmin_q (w - q)^T H (w - q)
where H is the Hessian of the layer loss with respect to weights. The dampening parameter adds λI to H for numerical stability, where λ = damp_percent × mean(diag(H)). Group-wise quantization applies separate scale and zero-point parameters per group_size consecutive weights, allowing finer-grained quantization at the cost of additional storage for the quantization parameters.
The desc_act option (also known as act-order) reorders columns by decreasing activation magnitude before quantization. This ensures that the most important weights (those multiplied by the largest activations) are quantized first, when the accumulated error is smallest.
Configuration Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
bits |
int |
(required) | Number of bits for quantization. Must be 2, 3, 4, or 8. |
dataset |
str or List[str] |
None |
Calibration dataset name or list of strings. |
group_size |
int |
128 |
Number of weights sharing quantization parameters. -1 for per-column. |
damp_percent |
float |
0.1 |
Dampening as fraction of average Hessian diagonal. |
desc_act |
bool |
False |
Quantize columns in decreasing activation order. |
act_group_aware |
bool |
True |
Use GAR (group aware activation order). Only when desc_act=False.
|
sym |
bool |
True |
Use symmetric quantization. |
true_sequential |
bool |
True |
Layer-wise quantization within blocks. |
format |
str |
"gptq" |
Weight format: gptq (v1) or gptq_v2.
|
backend |
str |
None |
GPTQ inference kernel backend selection. |
Metadata
| Key | Value |
|---|---|
| source Paper | GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers |
| source Repo | optimum |
| domains | Quantization, NLP, Optimization |
Related
- implemented_by → Implementation:Huggingface_Optimum_GPTQQuantizer_Init
- Heuristic:Huggingface_Optimum_GPTQ_Quantization_Defaults