Principle:Turboderp org Exllamav2 Quantization Sensitivity Measurement

Knowledge Sources	GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers
Domains	Quantization, Model_Compression, Deep_Learning
Last Updated	2026-02-15 00:00 GMT

Overview

Quantization sensitivity measurement is the process of profiling each layer in a neural network to determine how much reconstruction error it incurs under different quantization configurations, producing a per-layer error map that guides optimal bit allocation.

Description

Not all layers in a transformer model are equally sensitive to quantization. Some layers (e.g., early attention projections, certain MLP down-projections) produce large output perturbations when quantized aggressively, while others tolerate aggressive compression with negligible quality loss. Sensitivity measurement answers the question: for each layer, what is the accuracy trade-off at each available quantization setting?

The process works as follows:

Compute embeddings: Run the tokenized calibration data through the embedding layer to produce initial hidden states.
Layer-by-layer forward pass: For each subsequent layer, pass the hidden states through the FP16 layer to get reference outputs, then quantize the layer's weights under every candidate configuration (varying bit width, group size, and scale bits) and measure the relative Frobenius norm error between FP16 and quantized outputs.
Record error profiles: Store each layer's accuracy at each configuration as a list of (total_bits, accuracy) tuples.

The resulting measurement data forms a cost-accuracy frontier for every layer, which is the input to the subsequent bit allocation optimization step.

Usage

Sensitivity measurement is the second step in the EXL2 conversion pipeline, executed after calibration tokenization and before bit allocation optimization. It is typically the most time-consuming step, as it requires loading each layer onto the GPU and running multiple quantization trials.

Theoretical Basis

Reconstruction Error Metric

For each layer l with FP16 output Y and quantized output Y_hat, the accuracy is defined as:

accuracy_l = 1 - (1/N) * sum_i( ||Y_hat_i - Y_i||_F / ||Y_i||_F )

where i ranges over calibration rows and ||.||_F is the Frobenius norm. This metric is bounded in [0, 1], where 1.0 means perfect reconstruction. A minimum accuracy threshold of 0.1 is enforced; values below this indicate a measurement or inference error and cause the process to abort.

Hessian Computation

For each linear sub-layer, an AdaptiveGPTQ quantizer accumulates the Hessian matrix from calibration inputs:

H += X_batch^T * X_batch

The Hessian captures the second-order structure of the layer's input distribution, allowing GPTQ to make optimal rounding decisions. Sub-layers that share the same input (e.g., Q/K/V projections all receive the post-norm hidden state) can reuse the same Hessian to save computation.

Quantization Configurations Tested

Each layer is tested under a reduced set of quantization parameter combinations from qparams_attn and qparams_mlp. These include:

Bit widths: 2, 3, 4, 5, 6, 8 bits
Group sizes: 32, 64, 128, 256
Scale bits: 4, 6
Mixed-precision proportions: e.g., 65% 4-bit + 35% 3-bit

The get_qparams_reduced function generates only the Pareto-relevant combinations to avoid combinatorial explosion across the sub-layers of each module.

Related Pages

Implemented By

Implementation:Turboderp_org_Exllamav2_Measure_Quant

Uses Heuristic

Heuristic:Turboderp_org_Exllamav2_Quantization_Conversion_Tips

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment