Implementation:Turboderp org Exllamav2 Measure Quant

Knowledge Sources	ExLlamaV2
Domains	Quantization, Model_Compression, Deep_Learning
Last Updated	2026-02-15 00:00 GMT

Overview

Concrete tool for measuring per-layer quantization sensitivity provided by exllamav2.

Description

The measurement module provides two main entry points. First, embeddings() computes the initial hidden states by running calibration token IDs through the model's embedding layer. Second, measure_quant() iterates layer-by-layer through the entire model, quantizing each layer under every candidate configuration and recording the reconstruction accuracy. The result is a measurement dictionary mapping each layer key to a list of {accuracy, total_bits, *_proj: qparams} records.

The function supports checkpoint/resume: every 180 seconds it saves intermediate hidden states and the current measurement progress, allowing the process to be interrupted and resumed without losing work. A SIGINT handler provides graceful exit support.

Usage

Call embeddings() after tokenization to generate initial hidden states, then call measure_quant() to profile all layers. The resulting job["measurement"] is consumed by the optimization step.

Code Reference

Source Location

Repository: exllamav2
File: exllamav2/conversion/measure.py
Lines: L71-89 (embeddings), L409-736 (measure_quant), L94-123 (test_quant), L126-141 (test_error), L144-203 (measure_attn), L206-292 (measure_mlp), L295-365 (measure_moe_mlp)

Signature

def embeddings(job, save_fn, model, measure=False):
    """Compute initial token embeddings from calibration data."""

@torch.inference_mode()
def measure_quant(job, save_fn, model, hidden_state_offload_layers):
    """Measure quantization error for every layer under all candidate configurations."""

Key Internal Functions

def test_quant(source, lq, qparams):
    """Quantize a single linear layer under each QParams config, return variants and bit counts."""

def test_error(module, hidden_states, target_states, cache, attn_params):
    """Compute mean relative Frobenius norm accuracy across calibration rows."""

def measure_attn(module, hidden_states, target_states, quantizers, cache, attn_params, keep_q=False):
    """Measure all QKV+O projection combinations for an attention module."""

def measure_mlp(module, hidden_states, target_states, quantizers, cache, attn_params, reuse_h_up_proj=None):
    """Measure gate/up/down projection combinations for an MLP module."""

def measure_moe_mlp(module, hidden_states, target_states, quantizers, cache, attn_mask):
    """Measure w1/w3/w2 combinations across all experts for a MoE-MLP module."""

Import

from exllamav2.conversion.measure import embeddings, measure_quant

I/O Contract

Inputs

Name	Type	Required	Description
job	dict	Yes	Conversion job state. Key fields: `cal_filename` (path to tokenized calibration safetensors), `out_dir` (working directory), `output_measurement` (optional path for measurement JSON export)
save_fn	callable	Yes	Callback to persist job state to disk (called at each checkpoint)
model	ExLlamaV2	Yes	The loaded FP16 model instance. Modules are loaded/unloaded one at a time during measurement
hidden_state_offload_layers	int	Yes (measure_quant only)	Number of hidden state rows to keep on GPU; remaining rows are offloaded to CPU to manage VRAM

Outputs

Name	Type	Description
hidden_states.safetensors	File	Saved to `job["out_dir"]/hidden_states.safetensors`. Contains per-row embedding tensors keyed as `row.00000`, `row.00001`, etc.
job["measurement"]	dict (side effect)	Maps layer keys (e.g., `model.layers.0.self_attn`) to lists of measurement records. Each record contains `accuracy` (float), `total_bits` (int), and per-projection `QParams` dicts
measurement.json	File	Exported JSON file with the full measurement dictionary and `last_module_idx` for resume support
Return value	str	`"completed"` on success, `"interrupted"` if the user requested a graceful exit

Measurement Record Format

Each entry in the measurement list for an attention layer has this structure:

{
    "accuracy": 0.99876543,       # 1 - mean relative Frobenius norm error
    "total_bits": 15482880,       # Total bits for all projections combined
    "q_proj": {"group_size": 128, "bits": [4], "bits_prop": [1.0], "scale_bits": 4},
    "k_proj": {"group_size": 128, "bits": [4], "bits_prop": [1.0], "scale_bits": 4},
    "v_proj": {"group_size": 128, "bits": [3], "bits_prop": [1.0], "scale_bits": 4},
    "o_proj": {"group_size": 128, "bits": [4], "bits_prop": [1.0], "scale_bits": 4},
}

Checkpoint and Resume

The measurement process saves checkpoints every 180 seconds (configurable via snapshot_interval_s):

The current hidden states are written to a temporary file, then atomically renamed to hidden_states.safetensors.
The job dict is updated with job["measurement"] (accumulated so far) and job["last_module_idx"].
An invalid flag is used to detect incomplete writes: if the flag is present when resuming, the checkpoint is considered corrupt.

On resume, the function detects job["last_module_idx"] and skips already-measured layers.

Usage Examples

Basic Example

from exllamav2.conversion.measure import embeddings, measure_quant

# After tokenization, compute initial embeddings
embeddings(job, save_fn, model, measure=True)

# Measure all layers (keep first 8 rows on GPU)
result = measure_quant(job, save_fn, model, hidden_state_offload_layers=8)

if result == "completed":
    print("Measurement finished successfully")
    print(f"Measured {len(job['measurement'])} layers")

Dependencies

torch -- tensor operations, CUDA management, inference mode
safetensors -- loading/saving hidden state tensors
AdaptiveGPTQ -- Hessian accumulation and trial quantization
QParams, qparams_attn, qparams_mlp, get_qparams_reduced -- quantization parameter definitions and Pareto reduction
ExLlamaV2 model types -- ExLlamaV2Attention, ExLlamaV2MLP, ExLlamaV2MoEMLP, ExLlamaV2ParallelDecoder, ExLlamaV2Embedding, etc.

Related Pages

Implements Principle

Principle:Turboderp_org_Exllamav2_Quantization_Sensitivity_Measurement

Requires Environment

Uses Heuristic

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment