Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Turboderp org Exllamav2 Measure Quant

From Leeroopedia
Knowledge Sources
Domains Quantization, Model_Compression, Deep_Learning
Last Updated 2026-02-15 00:00 GMT

Overview

Concrete tool for measuring per-layer quantization sensitivity provided by exllamav2.

Description

The measurement module provides two main entry points. First, embeddings() computes the initial hidden states by running calibration token IDs through the model's embedding layer. Second, measure_quant() iterates layer-by-layer through the entire model, quantizing each layer under every candidate configuration and recording the reconstruction accuracy. The result is a measurement dictionary mapping each layer key to a list of {accuracy, total_bits, *_proj: qparams} records.

The function supports checkpoint/resume: every 180 seconds it saves intermediate hidden states and the current measurement progress, allowing the process to be interrupted and resumed without losing work. A SIGINT handler provides graceful exit support.

Usage

Call embeddings() after tokenization to generate initial hidden states, then call measure_quant() to profile all layers. The resulting job["measurement"] is consumed by the optimization step.

Code Reference

Source Location

  • Repository: exllamav2
  • File: exllamav2/conversion/measure.py
  • Lines: L71-89 (embeddings), L409-736 (measure_quant), L94-123 (test_quant), L126-141 (test_error), L144-203 (measure_attn), L206-292 (measure_mlp), L295-365 (measure_moe_mlp)

Signature

def embeddings(job, save_fn, model, measure=False):
    """Compute initial token embeddings from calibration data."""

@torch.inference_mode()
def measure_quant(job, save_fn, model, hidden_state_offload_layers):
    """Measure quantization error for every layer under all candidate configurations."""

Key Internal Functions

def test_quant(source, lq, qparams):
    """Quantize a single linear layer under each QParams config, return variants and bit counts."""

def test_error(module, hidden_states, target_states, cache, attn_params):
    """Compute mean relative Frobenius norm accuracy across calibration rows."""

def measure_attn(module, hidden_states, target_states, quantizers, cache, attn_params, keep_q=False):
    """Measure all QKV+O projection combinations for an attention module."""

def measure_mlp(module, hidden_states, target_states, quantizers, cache, attn_params, reuse_h_up_proj=None):
    """Measure gate/up/down projection combinations for an MLP module."""

def measure_moe_mlp(module, hidden_states, target_states, quantizers, cache, attn_mask):
    """Measure w1/w3/w2 combinations across all experts for a MoE-MLP module."""

Import

from exllamav2.conversion.measure import embeddings, measure_quant

I/O Contract

Inputs

Name Type Required Description
job dict Yes Conversion job state. Key fields: cal_filename (path to tokenized calibration safetensors), out_dir (working directory), output_measurement (optional path for measurement JSON export)
save_fn callable Yes Callback to persist job state to disk (called at each checkpoint)
model ExLlamaV2 Yes The loaded FP16 model instance. Modules are loaded/unloaded one at a time during measurement
hidden_state_offload_layers int Yes (measure_quant only) Number of hidden state rows to keep on GPU; remaining rows are offloaded to CPU to manage VRAM

Outputs

Name Type Description
hidden_states.safetensors File Saved to job["out_dir"]/hidden_states.safetensors. Contains per-row embedding tensors keyed as row.00000, row.00001, etc.
job["measurement"] dict (side effect) Maps layer keys (e.g., model.layers.0.self_attn) to lists of measurement records. Each record contains accuracy (float), total_bits (int), and per-projection QParams dicts
measurement.json File Exported JSON file with the full measurement dictionary and last_module_idx for resume support
Return value str "completed" on success, "interrupted" if the user requested a graceful exit

Measurement Record Format

Each entry in the measurement list for an attention layer has this structure:

{
    "accuracy": 0.99876543,       # 1 - mean relative Frobenius norm error
    "total_bits": 15482880,       # Total bits for all projections combined
    "q_proj": {"group_size": 128, "bits": [4], "bits_prop": [1.0], "scale_bits": 4},
    "k_proj": {"group_size": 128, "bits": [4], "bits_prop": [1.0], "scale_bits": 4},
    "v_proj": {"group_size": 128, "bits": [3], "bits_prop": [1.0], "scale_bits": 4},
    "o_proj": {"group_size": 128, "bits": [4], "bits_prop": [1.0], "scale_bits": 4},
}

Checkpoint and Resume

The measurement process saves checkpoints every 180 seconds (configurable via snapshot_interval_s):

  1. The current hidden states are written to a temporary file, then atomically renamed to hidden_states.safetensors.
  2. The job dict is updated with job["measurement"] (accumulated so far) and job["last_module_idx"].
  3. An invalid flag is used to detect incomplete writes: if the flag is present when resuming, the checkpoint is considered corrupt.

On resume, the function detects job["last_module_idx"] and skips already-measured layers.

Usage Examples

Basic Example

from exllamav2.conversion.measure import embeddings, measure_quant

# After tokenization, compute initial embeddings
embeddings(job, save_fn, model, measure=True)

# Measure all layers (keep first 8 rows on GPU)
result = measure_quant(job, save_fn, model, hidden_state_offload_layers=8)

if result == "completed":
    print("Measurement finished successfully")
    print(f"Measured {len(job['measurement'])} layers")

Dependencies

  • torch -- tensor operations, CUDA management, inference mode
  • safetensors -- loading/saving hidden state tensors
  • AdaptiveGPTQ -- Hessian accumulation and trial quantization
  • QParams, qparams_attn, qparams_mlp, get_qparams_reduced -- quantization parameter definitions and Pareto reduction
  • ExLlamaV2 model types -- ExLlamaV2Attention, ExLlamaV2MLP, ExLlamaV2MoEMLP, ExLlamaV2ParallelDecoder, ExLlamaV2Embedding, etc.

Related Pages

Implements Principle

Requires Environment

Uses Heuristic

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment