Principle:Turboderp org Exllamav2 Quantization Sensitivity Measurement
| Knowledge Sources | |
|---|---|
| Domains | Quantization, Model_Compression, Deep_Learning |
| Last Updated | 2026-02-15 00:00 GMT |
Overview
Quantization sensitivity measurement is the process of profiling each layer in a neural network to determine how much reconstruction error it incurs under different quantization configurations, producing a per-layer error map that guides optimal bit allocation.
Description
Not all layers in a transformer model are equally sensitive to quantization. Some layers (e.g., early attention projections, certain MLP down-projections) produce large output perturbations when quantized aggressively, while others tolerate aggressive compression with negligible quality loss. Sensitivity measurement answers the question: for each layer, what is the accuracy trade-off at each available quantization setting?
The process works as follows:
- Compute embeddings: Run the tokenized calibration data through the embedding layer to produce initial hidden states.
- Layer-by-layer forward pass: For each subsequent layer, pass the hidden states through the FP16 layer to get reference outputs, then quantize the layer's weights under every candidate configuration (varying bit width, group size, and scale bits) and measure the relative Frobenius norm error between FP16 and quantized outputs.
- Record error profiles: Store each layer's accuracy at each configuration as a list of
(total_bits, accuracy)tuples.
The resulting measurement data forms a cost-accuracy frontier for every layer, which is the input to the subsequent bit allocation optimization step.
Usage
Sensitivity measurement is the second step in the EXL2 conversion pipeline, executed after calibration tokenization and before bit allocation optimization. It is typically the most time-consuming step, as it requires loading each layer onto the GPU and running multiple quantization trials.
Theoretical Basis
Reconstruction Error Metric
For each layer l with FP16 output Y and quantized output Y_hat, the accuracy is defined as:
accuracy_l = 1 - (1/N) * sum_i( ||Y_hat_i - Y_i||_F / ||Y_i||_F )
where i ranges over calibration rows and ||.||_F is the Frobenius norm. This metric is bounded in [0, 1], where 1.0 means perfect reconstruction. A minimum accuracy threshold of 0.1 is enforced; values below this indicate a measurement or inference error and cause the process to abort.
Hessian Computation
For each linear sub-layer, an AdaptiveGPTQ quantizer accumulates the Hessian matrix from calibration inputs:
H += X_batch^T * X_batch
The Hessian captures the second-order structure of the layer's input distribution, allowing GPTQ to make optimal rounding decisions. Sub-layers that share the same input (e.g., Q/K/V projections all receive the post-norm hidden state) can reuse the same Hessian to save computation.
Quantization Configurations Tested
Each layer is tested under a reduced set of quantization parameter combinations from qparams_attn and qparams_mlp. These include:
- Bit widths: 2, 3, 4, 5, 6, 8 bits
- Group sizes: 32, 64, 128, 256
- Scale bits: 4, 6
- Mixed-precision proportions: e.g., 65% 4-bit + 35% 3-bit
The get_qparams_reduced function generates only the Pareto-relevant combinations to avoid combinatorial explosion across the sub-layers of each module.