Principle:Huggingface Optimum Sequential Block Quantization
Overview
Algorithm for quantizing transformer blocks one at a time, propagating updated activations through the network to minimize cumulative error.
Description
Rather than quantizing all layers independently, GPTQ processes transformer blocks sequentially. For each block, the algorithm:
- Creates GPTQ solver instances for each linear layer in the block (or a subset if
modules_in_block_to_quantizeis specified). - Registers forward hooks on the linear layers to accumulate Hessian statistics via
add_batch()as calibration data flows through. - Runs calibration data through the block, building the Hessian matrix
H = 2 * X^T * Xfor each layer. - Calls
fasterquant()to solve for optimal quantized weights, producing quantized weights along with scale, zero-point, and activation order index (g_idx) parameters. - Updates the block's weights with the quantized values and captures the block's output as input for the next block.
When true_sequential=True (the default), layers within a block are quantized one at a time, so each layer sees inputs that have already passed through the previously quantized layers. When true_sequential=False, all layers in the block share the same Hessian computation pass.
Usage
This is the core quantization loop, applied after model conversion and input capture. It is invoked automatically by GPTQQuantizer.quantize_model().
Theoretical Basis
The key GPTQ equation per column is:
q* = argmin_q (w - q)^T H_F (w - q)
This is solved greedily column-by-column. The fasterquant algorithm processes columns in groups, using Cholesky decomposition of the Hessian for efficiency. For a group of columns, the algorithm:
- Computes the Cholesky factorization of the relevant Hessian block.
- Quantizes each column by rounding and computing the quantization error.
- Distributes the error across remaining columns using the Hessian information (error compensation).
Sequential block processing ensures that quantization error in earlier blocks is accounted for when quantizing later blocks. After each block is quantized, the updated block outputs are used as inputs for the next block. This propagation of updated activations through the network reduces the cumulative quantization error compared to independent block quantization.
The percdamp parameter adds dampening to the Hessian diagonal:
H_damped = H + λI, where λ = percdamp × mean(diag(H))
This ensures numerical stability during the Cholesky decomposition, particularly for ill-conditioned Hessians.
Metadata
| Key | Value |
|---|---|
| source Paper | GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers |
Related
- implemented_by → Implementation:Huggingface_Optimum_GPTQ_Fasterquant