Principle:Turboderp org Exllamav2 Layer Quantization
| Knowledge Sources | |
|---|---|
| Domains | Quantization, Model_Compression, Deep_Learning |
| Last Updated | 2026-02-15 00:00 GMT |
Overview
Layer quantization via GPTQ is the process of compressing neural network weights from high-precision (FP16) to low-precision (2-8 bit) representations by processing weights column-by-column, using the inverse Hessian to optimally round each weight and propagate rounding error to unprocessed columns.
Description
GPTQ (Generative Pre-trained Transformer Quantization) is a one-shot, post-training quantization method that achieves near-lossless compression of large language models. Unlike naive round-to-nearest (RTN) quantization, GPTQ accounts for the correlations between weights by using second-order (Hessian) information derived from the calibration data.
The core idea is to process weight columns sequentially: when a column is quantized (rounded to a discrete grid), the resulting error is propagated to all remaining unprocessed columns using a correction factor derived from the inverse Hessian. This ensures that the overall output perturbation is minimized, not just the per-element rounding error.
ExLlamaV2 extends standard GPTQ with an Adaptive GPTQ variant. In this approach, the optimal quantization parameters for each linear layer have already been determined by the bit allocation optimization step. During the actual quantization, the layer is quantized using those parameters, then the quantized weights are packed into the EXL2 format, and a reconstruction sanity check verifies that the packed representation matches the quantized weights.
Usage
Layer quantization is the fourth step in the EXL2 pipeline, applied after the bit allocation strategy has been determined. Each layer is loaded, quantized according to its assigned strategy, and saved to disk before advancing to the next layer.
Theoretical Basis
GPTQ Algorithm
Given a weight matrix W of shape (d_out, d_in) and the Hessian H = X^T X from calibration data:
1. Compute H_inv = (H + lambda * I)^{-1} # Cholesky-based inversion with dampening
2. For each column j = 0, 1, ..., d_in-1:
a. q_j = quantize(W[:, j]) # Round column j to quantization grid
b. error_j = (W[:, j] - q_j) / H_inv[j, j] # Normalized error
c. W[:, j+1:] -= error_j * H_inv[j, j+1:] # Propagate error to remaining columns
d. W[:, j] = q_j # Commit quantized column
The key insight is that step (c) adjusts the remaining weights to compensate for the rounding error in column j, weighted by how much those weights co-vary with column j (as captured by the Hessian).
Adaptive GPTQ Extension
ExLlamaV2's Adaptive GPTQ adds the following capabilities:
- Mixed-precision quantization: Within a single matrix, different groups of weights can use different bit widths (e.g., 65% of groups at 4-bit, 35% at 3-bit).
- Configurable group sizes: Quantization scale factors are computed per group of rows (e.g., groups of 32, 64, 128, or 256 rows).
- Scale quantization: The scale factors themselves can be quantized to 4 or 6 bits for additional compression.
Reconstruction Verification
After quantization and packing, the implementation performs two sanity checks:
- Unpack check: Dequantize the packed weights and compare to the quantized matrix. Maximum allowed difference: 0.05.
- Forward check: Pass an identity matrix through the reconstructed linear layer and compare to the quantized weights. Maximum allowed difference: 0.075.
These checks catch packing bugs, CUDA kernel errors, and numerical instabilities.