Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:Huggingface Optimum GPTQ Quantizer Configuration

From Leeroopedia

Overview

Configuration schema for GPTQ post-training quantization that defines bit-width, group size, dampening, and quantization strategy parameters.

Description

GPTQ (Generative Pre-trained Transformer Quantization) reduces model weights to low-bit representations (2-8 bits) using Hessian-based calibration. The quantizer configuration defines the key parameters that control the quantization process:

  • Bit-width (bits) controls the precision/compression trade-off. Supported values are 2, 3, 4, and 8 bits.
  • Group size (group_size) determines how many weights share quantization parameters. The default is 128; setting it to -1 enables per-column quantization.
  • Dampening percent (damp_percent) stabilizes the Hessian inverse computation. The recommended default is 0.1.
  • Activation ordering (desc_act) can improve quantization quality by quantizing columns in order of decreasing activation magnitude, at the cost of slower inference.
  • Symmetric quantization (sym) toggles between symmetric and asymmetric quantization modes. Asymmetric quantization requires gptqmodel.
  • Weight format (format) selects between gptq (v1) and gptq_v2 formats. The v2 format is used internally by gptqmodel for asymmetric support.
  • True sequential (true_sequential) enables layer-wise quantization within a single Transformer block, so each layer is quantized using inputs that have passed through previously quantized layers.

The configuration is validated at initialization: bits must be in [2, 3, 4, 8], group_size must be greater than 0 or equal to -1, and damp_percent must be strictly between 0 and 1.

Usage

Use when setting up GPTQ quantization for any large language model to reduce memory footprint. The configuration parameters are passed to the GPTQQuantizer constructor and determine all aspects of the quantization behavior.

from optimum.gptq import GPTQQuantizer

quantizer = GPTQQuantizer(
    bits=4,
    dataset="wikitext2",
    group_size=128,
    damp_percent=0.1,
    desc_act=False,
    sym=True,
    true_sequential=True,
    format="gptq",
)

Theoretical Basis

GPTQ is based on the OBQ (Optimal Brain Quantization) framework. For each weight column w, GPTQ solves:

argmin_q (w - q)^T H (w - q)

where H is the Hessian of the layer loss with respect to weights. The dampening parameter adds λI to H for numerical stability, where λ = damp_percent × mean(diag(H)). Group-wise quantization applies separate scale and zero-point parameters per group_size consecutive weights, allowing finer-grained quantization at the cost of additional storage for the quantization parameters.

The desc_act option (also known as act-order) reorders columns by decreasing activation magnitude before quantization. This ensures that the most important weights (those multiplied by the largest activations) are quantized first, when the accumulated error is smallest.

Configuration Parameters

Parameter Type Default Description
bits int (required) Number of bits for quantization. Must be 2, 3, 4, or 8.
dataset str or List[str] None Calibration dataset name or list of strings.
group_size int 128 Number of weights sharing quantization parameters. -1 for per-column.
damp_percent float 0.1 Dampening as fraction of average Hessian diagonal.
desc_act bool False Quantize columns in decreasing activation order.
act_group_aware bool True Use GAR (group aware activation order). Only when desc_act=False.
sym bool True Use symmetric quantization.
true_sequential bool True Layer-wise quantization within blocks.
format str "gptq" Weight format: gptq (v1) or gptq_v2.
backend str None GPTQ inference kernel backend selection.

Metadata

Key Value
source Paper GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers
source Repo optimum
domains Quantization, NLP, Optimization

Related

Connections

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment