Implementation:Turboderp org Exllamav2 Quant Layers

Knowledge Sources	ExLlamaV2
Domains	Quantization, Model_Compression, Deep_Learning
Last Updated	2026-02-15 00:00 GMT

Overview

Concrete tool for applying GPTQ-based weight quantization to each model layer according to the optimized strategy provided by exllamav2.

Description

The quant function iterates through every module in the model, creates AdaptiveGPTQ quantizer instances for each linear sub-layer, accumulates Hessian information from the calibration hidden states via a forward pass, then quantizes each linear layer using the parameters assigned by the optimization step. Quantized weights are packed into the EXL2 format and saved as individual safetensors files. After quantization, the function performs a verification forward pass to compute the relative Frobenius norm error, and for the final lm_head layer, computes calibration perplexity. Checkpoints are saved every 180 seconds.

Usage

Call quant after optimize has populated job["strategy"]. This is the most VRAM-intensive and time-consuming step after measurement.

Code Reference

Source Location

Repository: exllamav2
File: exllamav2/conversion/quantize.py
Lines: L256-543 (main quant function)

Internal Sub-Functions

Function	Lines	Description
`quant_linear`	L50-132	Quantize a single ExLlamaV2Linear layer: configure AdaptiveGPTQ, quantize, pack to EXL2, verify reconstruction
`quant_attn`	L135-151	Orchestrate Q/K/V/O projection quantization with Hessian reuse (K, V reuse Q's Hessian)
`quant_mlp`	L154-181	Orchestrate gate/up/down projection quantization with Hessian reuse (gate reuses up's Hessian)
`quant_moe_mlp`	L184-200	Quantize all experts' w1/w3/w2 projections for MoE layers
`quant_lm_head`	L203-209	Quantize the language model head (optionally using RTN for very large heads)
`quant_parallel_decoder`	L212-217	Quantize attention + MLP for architectures with parallel decoder blocks

Signature

@torch.inference_mode()
def quant(job, save_fn, model):

Import

from exllamav2.conversion.quantize import quant

I/O Contract

Inputs

Name	Type	Required	Description
job	dict	Yes	Conversion job state. Key fields: `job["strategy"]` (dict from optimize mapping layer keys to QParams), `job["out_dir"]` (working directory), `job["cal_filename"]` (tokenized calibration data), `job["head_bits"]` (int, bit width for lm_head)
save_fn	callable	Yes	Callback to persist job state (called at checkpoints)
model	ExLlamaV2	Yes	The loaded FP16 model instance; modules are loaded and unloaded one at a time

Outputs

Name	Type	Description
Per-module safetensors	Files	One file per linear layer saved to `job["out_dir"]/out_tensor/{module_key}.safetensors`, containing packed quantized weights (`q_weight`, `q_scale`, `q_scale_max`, `q_groups`, `q_invperm`)
job["q_last_module_idx"]	int (side effect)	Tracks the last completed module index for checkpoint/resume support
hidden_states.safetensors	File (updated)	Overwritten with post-quantization hidden states as the pass advances through layers

Quantization Flow per Layer

Load module onto GPU
Create AdaptiveGPTQ quantizers for each linear sub-layer
Forward pass through calibration rows to accumulate Hessian (add_batch)
Quantize each linear sub-layer using quant_linear:
1. Configure quantizer with assigned group_size, bits, bits_prop, scale_bits
2. Call lq.quantize(keep_qweight=True, apply=True)
3. Pack into EXL2 format and save to disk
4. Reconstruct and verify (unpack check + forward check)
5. Apply reconstructed weights back to the module for the next layer's forward pass
Verification forward pass: compute relative Frobenius norm error between FP16 and quantized outputs
Advance hidden states to quantized outputs for the next layer
Checkpoint every 180 seconds

Usage Examples

Basic Example

from exllamav2.conversion.quantize import quant

# After optimize() has set job["strategy"]
quant(job, save_fn, model)

# Quantized tensors are now saved in job["out_dir"]/out_tensor/
# Each file is named: {layer_key}.safetensors
# e.g., model.layers.0.self_attn.q_proj.safetensors

Inspecting Quantized Output

from safetensors import safe_open
import os

tensor_dir = os.path.join(job["out_dir"], "out_tensor")
for f in sorted(os.listdir(tensor_dir)):
    filepath = os.path.join(tensor_dir, f)
    with safe_open(filepath, framework="pt") as sf:
        keys = list(sf.keys())
        print(f"{f}: {keys}")

Dependencies

torch -- tensor operations, CUDA synchronization, inference mode
safetensors -- loading hidden states and saving packed quantized weights
AdaptiveGPTQ -- Hessian accumulation, GPTQ quantization, EXL2 packing
QParams, qparams_headoptions -- quantization parameter definitions
exllamav2_ext (C++ extension) -- softcap operation for models with logit softcapping
ExLlamaV2 model types -- ExLlamaV2Linear, ExLlamaV2Attention, ExLlamaV2MLP, ExLlamaV2MoEMLP, ExLlamaV2ParallelDecoder

Related Pages

Implements Principle

Principle:Turboderp_org_Exllamav2_Layer_Quantization

Requires Environment

Uses Heuristic

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment