Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Turboderp org Exllamav2 Quant Layers

From Leeroopedia
Knowledge Sources
Domains Quantization, Model_Compression, Deep_Learning
Last Updated 2026-02-15 00:00 GMT

Overview

Concrete tool for applying GPTQ-based weight quantization to each model layer according to the optimized strategy provided by exllamav2.

Description

The quant function iterates through every module in the model, creates AdaptiveGPTQ quantizer instances for each linear sub-layer, accumulates Hessian information from the calibration hidden states via a forward pass, then quantizes each linear layer using the parameters assigned by the optimization step. Quantized weights are packed into the EXL2 format and saved as individual safetensors files. After quantization, the function performs a verification forward pass to compute the relative Frobenius norm error, and for the final lm_head layer, computes calibration perplexity. Checkpoints are saved every 180 seconds.

Usage

Call quant after optimize has populated job["strategy"]. This is the most VRAM-intensive and time-consuming step after measurement.

Code Reference

Source Location

  • Repository: exllamav2
  • File: exllamav2/conversion/quantize.py
  • Lines: L256-543 (main quant function)

Internal Sub-Functions

Function Lines Description
quant_linear L50-132 Quantize a single ExLlamaV2Linear layer: configure AdaptiveGPTQ, quantize, pack to EXL2, verify reconstruction
quant_attn L135-151 Orchestrate Q/K/V/O projection quantization with Hessian reuse (K, V reuse Q's Hessian)
quant_mlp L154-181 Orchestrate gate/up/down projection quantization with Hessian reuse (gate reuses up's Hessian)
quant_moe_mlp L184-200 Quantize all experts' w1/w3/w2 projections for MoE layers
quant_lm_head L203-209 Quantize the language model head (optionally using RTN for very large heads)
quant_parallel_decoder L212-217 Quantize attention + MLP for architectures with parallel decoder blocks

Signature

@torch.inference_mode()
def quant(job, save_fn, model):

Import

from exllamav2.conversion.quantize import quant

I/O Contract

Inputs

Name Type Required Description
job dict Yes Conversion job state. Key fields: job["strategy"] (dict from optimize mapping layer keys to QParams), job["out_dir"] (working directory), job["cal_filename"] (tokenized calibration data), job["head_bits"] (int, bit width for lm_head)
save_fn callable Yes Callback to persist job state (called at checkpoints)
model ExLlamaV2 Yes The loaded FP16 model instance; modules are loaded and unloaded one at a time

Outputs

Name Type Description
Per-module safetensors Files One file per linear layer saved to job["out_dir"]/out_tensor/{module_key}.safetensors, containing packed quantized weights (q_weight, q_scale, q_scale_max, q_groups, q_invperm)
job["q_last_module_idx"] int (side effect) Tracks the last completed module index for checkpoint/resume support
hidden_states.safetensors File (updated) Overwritten with post-quantization hidden states as the pass advances through layers

Quantization Flow per Layer

  1. Load module onto GPU
  2. Create AdaptiveGPTQ quantizers for each linear sub-layer
  3. Forward pass through calibration rows to accumulate Hessian (add_batch)
  4. Quantize each linear sub-layer using quant_linear:
    1. Configure quantizer with assigned group_size, bits, bits_prop, scale_bits
    2. Call lq.quantize(keep_qweight=True, apply=True)
    3. Pack into EXL2 format and save to disk
    4. Reconstruct and verify (unpack check + forward check)
    5. Apply reconstructed weights back to the module for the next layer's forward pass
  5. Verification forward pass: compute relative Frobenius norm error between FP16 and quantized outputs
  6. Advance hidden states to quantized outputs for the next layer
  7. Checkpoint every 180 seconds

Usage Examples

Basic Example

from exllamav2.conversion.quantize import quant

# After optimize() has set job["strategy"]
quant(job, save_fn, model)

# Quantized tensors are now saved in job["out_dir"]/out_tensor/
# Each file is named: {layer_key}.safetensors
# e.g., model.layers.0.self_attn.q_proj.safetensors

Inspecting Quantized Output

from safetensors import safe_open
import os

tensor_dir = os.path.join(job["out_dir"], "out_tensor")
for f in sorted(os.listdir(tensor_dir)):
    filepath = os.path.join(tensor_dir, f)
    with safe_open(filepath, framework="pt") as sf:
        keys = list(sf.keys())
        print(f"{f}: {keys}")

Dependencies

  • torch -- tensor operations, CUDA synchronization, inference mode
  • safetensors -- loading hidden states and saving packed quantized weights
  • AdaptiveGPTQ -- Hessian accumulation, GPTQ quantization, EXL2 packing
  • QParams, qparams_headoptions -- quantization parameter definitions
  • exllamav2_ext (C++ extension) -- softcap operation for models with logit softcapping
  • ExLlamaV2 model types -- ExLlamaV2Linear, ExLlamaV2Attention, ExLlamaV2MLP, ExLlamaV2MoEMLP, ExLlamaV2ParallelDecoder

Related Pages

Implements Principle

Requires Environment

Uses Heuristic

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment