Principle:Turboderp org Exllamav2 Layer Quantization

Knowledge Sources	GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers
Domains	Quantization, Model_Compression, Deep_Learning
Last Updated	2026-02-15 00:00 GMT

Overview

Layer quantization via GPTQ is the process of compressing neural network weights from high-precision (FP16) to low-precision (2-8 bit) representations by processing weights column-by-column, using the inverse Hessian to optimally round each weight and propagate rounding error to unprocessed columns.

Description

GPTQ (Generative Pre-trained Transformer Quantization) is a one-shot, post-training quantization method that achieves near-lossless compression of large language models. Unlike naive round-to-nearest (RTN) quantization, GPTQ accounts for the correlations between weights by using second-order (Hessian) information derived from the calibration data.

The core idea is to process weight columns sequentially: when a column is quantized (rounded to a discrete grid), the resulting error is propagated to all remaining unprocessed columns using a correction factor derived from the inverse Hessian. This ensures that the overall output perturbation is minimized, not just the per-element rounding error.

ExLlamaV2 extends standard GPTQ with an Adaptive GPTQ variant. In this approach, the optimal quantization parameters for each linear layer have already been determined by the bit allocation optimization step. During the actual quantization, the layer is quantized using those parameters, then the quantized weights are packed into the EXL2 format, and a reconstruction sanity check verifies that the packed representation matches the quantized weights.

Usage

Layer quantization is the fourth step in the EXL2 pipeline, applied after the bit allocation strategy has been determined. Each layer is loaded, quantized according to its assigned strategy, and saved to disk before advancing to the next layer.

Theoretical Basis

GPTQ Algorithm

Given a weight matrix W of shape (d_out, d_in) and the Hessian H = X^T X from calibration data:

1. Compute H_inv = (H + lambda * I)^{-1}    # Cholesky-based inversion with dampening
2. For each column j = 0, 1, ..., d_in-1:
   a. q_j = quantize(W[:, j])               # Round column j to quantization grid
   b. error_j = (W[:, j] - q_j) / H_inv[j, j]  # Normalized error
   c. W[:, j+1:] -= error_j * H_inv[j, j+1:]   # Propagate error to remaining columns
   d. W[:, j] = q_j                         # Commit quantized column

The key insight is that step (c) adjusts the remaining weights to compensate for the rounding error in column j, weighted by how much those weights co-vary with column j (as captured by the Hessian).

Adaptive GPTQ Extension

ExLlamaV2's Adaptive GPTQ adds the following capabilities:

Mixed-precision quantization: Within a single matrix, different groups of weights can use different bit widths (e.g., 65% of groups at 4-bit, 35% at 3-bit).
Configurable group sizes: Quantization scale factors are computed per group of rows (e.g., groups of 32, 64, 128, or 256 rows).
Scale quantization: The scale factors themselves can be quantized to 4 or 6 bits for additional compression.

Reconstruction Verification

After quantization and packing, the implementation performs two sanity checks:

Unpack check: Dequantize the packed weights and compare to the quantized matrix. Maximum allowed difference: 0.05.
Forward check: Pass an identity matrix through the reconstructed linear layer and compare to the quantized weights. Maximum allowed difference: 0.075.

These checks catch packing bugs, CUDA kernel errors, and numerical instabilities.

Related Pages

Implemented By

Implementation:Turboderp_org_Exllamav2_Quant_Layers

Uses Heuristic

Heuristic:Turboderp_org_Exllamav2_Quantization_Conversion_Tips

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment