Overview
Concrete tool for applying GPTQ-based weight quantization to each model layer according to the optimized strategy provided by exllamav2.
Description
The quant function iterates through every module in the model, creates AdaptiveGPTQ quantizer instances for each linear sub-layer, accumulates Hessian information from the calibration hidden states via a forward pass, then quantizes each linear layer using the parameters assigned by the optimization step. Quantized weights are packed into the EXL2 format and saved as individual safetensors files. After quantization, the function performs a verification forward pass to compute the relative Frobenius norm error, and for the final lm_head layer, computes calibration perplexity. Checkpoints are saved every 180 seconds.
Usage
Call quant after optimize has populated job["strategy"]. This is the most VRAM-intensive and time-consuming step after measurement.
Code Reference
Source Location
- Repository: exllamav2
- File:
exllamav2/conversion/quantize.py
- Lines: L256-543 (main
quant function)
Internal Sub-Functions
| Function |
Lines |
Description
|
quant_linear |
L50-132 |
Quantize a single ExLlamaV2Linear layer: configure AdaptiveGPTQ, quantize, pack to EXL2, verify reconstruction
|
quant_attn |
L135-151 |
Orchestrate Q/K/V/O projection quantization with Hessian reuse (K, V reuse Q's Hessian)
|
quant_mlp |
L154-181 |
Orchestrate gate/up/down projection quantization with Hessian reuse (gate reuses up's Hessian)
|
quant_moe_mlp |
L184-200 |
Quantize all experts' w1/w3/w2 projections for MoE layers
|
quant_lm_head |
L203-209 |
Quantize the language model head (optionally using RTN for very large heads)
|
quant_parallel_decoder |
L212-217 |
Quantize attention + MLP for architectures with parallel decoder blocks
|
Signature
@torch.inference_mode()
def quant(job, save_fn, model):
Import
from exllamav2.conversion.quantize import quant
I/O Contract
Inputs
| Name |
Type |
Required |
Description
|
| job |
dict |
Yes |
Conversion job state. Key fields: job["strategy"] (dict from optimize mapping layer keys to QParams), job["out_dir"] (working directory), job["cal_filename"] (tokenized calibration data), job["head_bits"] (int, bit width for lm_head)
|
| save_fn |
callable |
Yes |
Callback to persist job state (called at checkpoints)
|
| model |
ExLlamaV2 |
Yes |
The loaded FP16 model instance; modules are loaded and unloaded one at a time
|
Outputs
| Name |
Type |
Description
|
| Per-module safetensors |
Files |
One file per linear layer saved to job["out_dir"]/out_tensor/{module_key}.safetensors, containing packed quantized weights (q_weight, q_scale, q_scale_max, q_groups, q_invperm)
|
| job["q_last_module_idx"] |
int (side effect) |
Tracks the last completed module index for checkpoint/resume support
|
| hidden_states.safetensors |
File (updated) |
Overwritten with post-quantization hidden states as the pass advances through layers
|
Quantization Flow per Layer
- Load module onto GPU
- Create AdaptiveGPTQ quantizers for each linear sub-layer
- Forward pass through calibration rows to accumulate Hessian (
add_batch)
- Quantize each linear sub-layer using
quant_linear:
- Configure quantizer with assigned group_size, bits, bits_prop, scale_bits
- Call
lq.quantize(keep_qweight=True, apply=True)
- Pack into EXL2 format and save to disk
- Reconstruct and verify (unpack check + forward check)
- Apply reconstructed weights back to the module for the next layer's forward pass
- Verification forward pass: compute relative Frobenius norm error between FP16 and quantized outputs
- Advance hidden states to quantized outputs for the next layer
- Checkpoint every 180 seconds
Usage Examples
Basic Example
from exllamav2.conversion.quantize import quant
# After optimize() has set job["strategy"]
quant(job, save_fn, model)
# Quantized tensors are now saved in job["out_dir"]/out_tensor/
# Each file is named: {layer_key}.safetensors
# e.g., model.layers.0.self_attn.q_proj.safetensors
Inspecting Quantized Output
from safetensors import safe_open
import os
tensor_dir = os.path.join(job["out_dir"], "out_tensor")
for f in sorted(os.listdir(tensor_dir)):
filepath = os.path.join(tensor_dir, f)
with safe_open(filepath, framework="pt") as sf:
keys = list(sf.keys())
print(f"{f}: {keys}")
Dependencies
- torch -- tensor operations, CUDA synchronization, inference mode
- safetensors -- loading hidden states and saving packed quantized weights
- AdaptiveGPTQ -- Hessian accumulation, GPTQ quantization, EXL2 packing
- QParams, qparams_headoptions -- quantization parameter definitions
- exllamav2_ext (C++ extension) -- softcap operation for models with logit softcapping
- ExLlamaV2 model types -- ExLlamaV2Linear, ExLlamaV2Attention, ExLlamaV2MLP, ExLlamaV2MoEMLP, ExLlamaV2ParallelDecoder
Related Pages
Implements Principle
Requires Environment
Uses Heuristic
Page Connections
Double-click a node to navigate. Hold to expand connections.