Principle:Huggingface Optimum Sequential Block Quantization

Overview

Algorithm for quantizing transformer blocks one at a time, propagating updated activations through the network to minimize cumulative error.

Description

Rather than quantizing all layers independently, GPTQ processes transformer blocks sequentially. For each block, the algorithm:

Creates GPTQ solver instances for each linear layer in the block (or a subset if modules_in_block_to_quantize is specified).
Registers forward hooks on the linear layers to accumulate Hessian statistics via add_batch() as calibration data flows through.
Runs calibration data through the block, building the Hessian matrix H = 2 * X^T * X for each layer.
Calls fasterquant() to solve for optimal quantized weights, producing quantized weights along with scale, zero-point, and activation order index (g_idx) parameters.
Updates the block's weights with the quantized values and captures the block's output as input for the next block.

When true_sequential=True (the default), layers within a block are quantized one at a time, so each layer sees inputs that have already passed through the previously quantized layers. When true_sequential=False, all layers in the block share the same Hessian computation pass.

Usage

This is the core quantization loop, applied after model conversion and input capture. It is invoked automatically by GPTQQuantizer.quantize_model().

Theoretical Basis

The key GPTQ equation per column is:

q* = argmin_q (w - q)^T H_F (w - q)

This is solved greedily column-by-column. The fasterquant algorithm processes columns in groups, using Cholesky decomposition of the Hessian for efficiency. For a group of columns, the algorithm:

Computes the Cholesky factorization of the relevant Hessian block.
Quantizes each column by rounding and computing the quantization error.
Distributes the error across remaining columns using the Hessian information (error compensation).

Sequential block processing ensures that quantization error in earlier blocks is accounted for when quantizing later blocks. After each block is quantized, the updated block outputs are used as inputs for the next block. This propagation of updated activations through the network reduces the cumulative quantization error compared to independent block quantization.

The percdamp parameter adds dampening to the Hessian diagonal:

H_damped = H + λI, where λ = percdamp × mean(diag(H))

This ensures numerical stability during the Cholesky decomposition, particularly for ill-conditioned Hessians.

Metadata

Key	Value
source Paper	GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers

Connections

Implementation:Huggingface_Optimum_GPTQ_Fasterquant

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment