Implementation:Turboderp org Exllamav2 Measure Quant
| Knowledge Sources | |
|---|---|
| Domains | Quantization, Model_Compression, Deep_Learning |
| Last Updated | 2026-02-15 00:00 GMT |
Overview
Concrete tool for measuring per-layer quantization sensitivity provided by exllamav2.
Description
The measurement module provides two main entry points. First, embeddings() computes the initial hidden states by running calibration token IDs through the model's embedding layer. Second, measure_quant() iterates layer-by-layer through the entire model, quantizing each layer under every candidate configuration and recording the reconstruction accuracy. The result is a measurement dictionary mapping each layer key to a list of {accuracy, total_bits, *_proj: qparams} records.
The function supports checkpoint/resume: every 180 seconds it saves intermediate hidden states and the current measurement progress, allowing the process to be interrupted and resumed without losing work. A SIGINT handler provides graceful exit support.
Usage
Call embeddings() after tokenization to generate initial hidden states, then call measure_quant() to profile all layers. The resulting job["measurement"] is consumed by the optimization step.
Code Reference
Source Location
- Repository: exllamav2
- File:
exllamav2/conversion/measure.py - Lines: L71-89 (
embeddings), L409-736 (measure_quant), L94-123 (test_quant), L126-141 (test_error), L144-203 (measure_attn), L206-292 (measure_mlp), L295-365 (measure_moe_mlp)
Signature
def embeddings(job, save_fn, model, measure=False):
"""Compute initial token embeddings from calibration data."""
@torch.inference_mode()
def measure_quant(job, save_fn, model, hidden_state_offload_layers):
"""Measure quantization error for every layer under all candidate configurations."""
Key Internal Functions
def test_quant(source, lq, qparams):
"""Quantize a single linear layer under each QParams config, return variants and bit counts."""
def test_error(module, hidden_states, target_states, cache, attn_params):
"""Compute mean relative Frobenius norm accuracy across calibration rows."""
def measure_attn(module, hidden_states, target_states, quantizers, cache, attn_params, keep_q=False):
"""Measure all QKV+O projection combinations for an attention module."""
def measure_mlp(module, hidden_states, target_states, quantizers, cache, attn_params, reuse_h_up_proj=None):
"""Measure gate/up/down projection combinations for an MLP module."""
def measure_moe_mlp(module, hidden_states, target_states, quantizers, cache, attn_mask):
"""Measure w1/w3/w2 combinations across all experts for a MoE-MLP module."""
Import
from exllamav2.conversion.measure import embeddings, measure_quant
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| job | dict | Yes | Conversion job state. Key fields: cal_filename (path to tokenized calibration safetensors), out_dir (working directory), output_measurement (optional path for measurement JSON export)
|
| save_fn | callable | Yes | Callback to persist job state to disk (called at each checkpoint) |
| model | ExLlamaV2 | Yes | The loaded FP16 model instance. Modules are loaded/unloaded one at a time during measurement |
| hidden_state_offload_layers | int | Yes (measure_quant only) | Number of hidden state rows to keep on GPU; remaining rows are offloaded to CPU to manage VRAM |
Outputs
| Name | Type | Description |
|---|---|---|
| hidden_states.safetensors | File | Saved to job["out_dir"]/hidden_states.safetensors. Contains per-row embedding tensors keyed as row.00000, row.00001, etc.
|
| job["measurement"] | dict (side effect) | Maps layer keys (e.g., model.layers.0.self_attn) to lists of measurement records. Each record contains accuracy (float), total_bits (int), and per-projection QParams dicts
|
| measurement.json | File | Exported JSON file with the full measurement dictionary and last_module_idx for resume support
|
| Return value | str | "completed" on success, "interrupted" if the user requested a graceful exit
|
Measurement Record Format
Each entry in the measurement list for an attention layer has this structure:
{
"accuracy": 0.99876543, # 1 - mean relative Frobenius norm error
"total_bits": 15482880, # Total bits for all projections combined
"q_proj": {"group_size": 128, "bits": [4], "bits_prop": [1.0], "scale_bits": 4},
"k_proj": {"group_size": 128, "bits": [4], "bits_prop": [1.0], "scale_bits": 4},
"v_proj": {"group_size": 128, "bits": [3], "bits_prop": [1.0], "scale_bits": 4},
"o_proj": {"group_size": 128, "bits": [4], "bits_prop": [1.0], "scale_bits": 4},
}
Checkpoint and Resume
The measurement process saves checkpoints every 180 seconds (configurable via snapshot_interval_s):
- The current hidden states are written to a temporary file, then atomically renamed to
hidden_states.safetensors. - The job dict is updated with
job["measurement"](accumulated so far) andjob["last_module_idx"]. - An
invalidflag is used to detect incomplete writes: if the flag is present when resuming, the checkpoint is considered corrupt.
On resume, the function detects job["last_module_idx"] and skips already-measured layers.
Usage Examples
Basic Example
from exllamav2.conversion.measure import embeddings, measure_quant
# After tokenization, compute initial embeddings
embeddings(job, save_fn, model, measure=True)
# Measure all layers (keep first 8 rows on GPU)
result = measure_quant(job, save_fn, model, hidden_state_offload_layers=8)
if result == "completed":
print("Measurement finished successfully")
print(f"Measured {len(job['measurement'])} layers")
Dependencies
- torch -- tensor operations, CUDA management, inference mode
- safetensors -- loading/saving hidden state tensors
- AdaptiveGPTQ -- Hessian accumulation and trial quantization
- QParams, qparams_attn, qparams_mlp, get_qparams_reduced -- quantization parameter definitions and Pareto reduction
- ExLlamaV2 model types -- ExLlamaV2Attention, ExLlamaV2MLP, ExLlamaV2MoEMLP, ExLlamaV2ParallelDecoder, ExLlamaV2Embedding, etc.
Related Pages
Implements Principle
Requires Environment
- Environment:Turboderp_org_Exllamav2_CUDA_GPU_Runtime
- Environment:Turboderp_org_Exllamav2_Build_Toolchain