Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:Turboderp org Exllamav2 Layer Quantization

From Leeroopedia
Knowledge Sources
Domains Quantization, Model_Compression, Deep_Learning
Last Updated 2026-02-15 00:00 GMT

Overview

Layer quantization via GPTQ is the process of compressing neural network weights from high-precision (FP16) to low-precision (2-8 bit) representations by processing weights column-by-column, using the inverse Hessian to optimally round each weight and propagate rounding error to unprocessed columns.

Description

GPTQ (Generative Pre-trained Transformer Quantization) is a one-shot, post-training quantization method that achieves near-lossless compression of large language models. Unlike naive round-to-nearest (RTN) quantization, GPTQ accounts for the correlations between weights by using second-order (Hessian) information derived from the calibration data.

The core idea is to process weight columns sequentially: when a column is quantized (rounded to a discrete grid), the resulting error is propagated to all remaining unprocessed columns using a correction factor derived from the inverse Hessian. This ensures that the overall output perturbation is minimized, not just the per-element rounding error.

ExLlamaV2 extends standard GPTQ with an Adaptive GPTQ variant. In this approach, the optimal quantization parameters for each linear layer have already been determined by the bit allocation optimization step. During the actual quantization, the layer is quantized using those parameters, then the quantized weights are packed into the EXL2 format, and a reconstruction sanity check verifies that the packed representation matches the quantized weights.

Usage

Layer quantization is the fourth step in the EXL2 pipeline, applied after the bit allocation strategy has been determined. Each layer is loaded, quantized according to its assigned strategy, and saved to disk before advancing to the next layer.

Theoretical Basis

GPTQ Algorithm

Given a weight matrix W of shape (d_out, d_in) and the Hessian H = X^T X from calibration data:

1. Compute H_inv = (H + lambda * I)^{-1}    # Cholesky-based inversion with dampening
2. For each column j = 0, 1, ..., d_in-1:
   a. q_j = quantize(W[:, j])               # Round column j to quantization grid
   b. error_j = (W[:, j] - q_j) / H_inv[j, j]  # Normalized error
   c. W[:, j+1:] -= error_j * H_inv[j, j+1:]   # Propagate error to remaining columns
   d. W[:, j] = q_j                         # Commit quantized column

The key insight is that step (c) adjusts the remaining weights to compensate for the rounding error in column j, weighted by how much those weights co-vary with column j (as captured by the Hessian).

Adaptive GPTQ Extension

ExLlamaV2's Adaptive GPTQ adds the following capabilities:

  • Mixed-precision quantization: Within a single matrix, different groups of weights can use different bit widths (e.g., 65% of groups at 4-bit, 35% at 3-bit).
  • Configurable group sizes: Quantization scale factors are computed per group of rows (e.g., groups of 32, 64, 128, or 256 rows).
  • Scale quantization: The scale factors themselves can be quantized to 4 or 6 bits for additional compression.

Reconstruction Verification

After quantization and packing, the implementation performs two sanity checks:

  1. Unpack check: Dequantize the packed weights and compare to the quantized matrix. Maximum allowed difference: 0.05.
  2. Forward check: Pass an identity matrix through the reconstructed linear layer and compare to the quantized weights. Maximum allowed difference: 0.075.

These checks catch packing bugs, CUDA kernel errors, and numerical instabilities.

Related Pages

Implemented By

Uses Heuristic

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment