Heuristic:Turboderp org Exllamav2 Quantization Conversion Tips
| Knowledge Sources | |
|---|---|
| Domains | Quantization, Model_Optimization |
| Last Updated | 2026-02-15 00:00 GMT |
Overview
Practical guidance for EXL2 model quantization: save measurement files (12x speedup on re-quant), use 2-8 bpw target range, check perplexity < 30 after conversion, and note that VRAM needs depend on model width not depth.
Description
EXL2 quantization converts full-precision models into mixed-bitrate quantized formats using an adaptive GPTQ algorithm with simulated annealing for optimal bit allocation. The process has two expensive passes (measurement and quantization) and several non-obvious rules that significantly affect quality and resource usage.
Usage
Apply these tips when running EXL2 model conversion via `convert_exl2.py`. The measurement pass is by far the most expensive operation, so understanding how to reuse it and validate results is critical.
The Insight (Rule of Thumb)
- Action: Always save the `measurement.json` file from the first conversion pass.
- Value: The measurement pass is approximately 12x slower than the quantization pass.
- Trade-off: Storing measurement.json allows skipping the measurement pass entirely on subsequent quants of the same model at different bitrates.
- Action: Target bitrate between 2 and 8 bpw (bits per weight).
- Value: Values outside this range trigger a warning. Head layer quantization is only useful at 6 and 8 bpw (6 bpw is the default, yielding mixed ~6.3 bpw effective).
- Trade-off: Lower bpw = smaller model but more quality loss. Below 2 bpw or above 8 bpw, the optimizer struggles to find valid allocations.
- Action: Check "calibration perplexity (quant)" after conversion.
- Value: Should be well below 30. If >= 30, quantization likely failed. If in the thousands, it failed catastrophically.
- Trade-off: N/A -- this is a quality validation check.
- Action: VRAM requirements are determined by model width (hidden size), not depth (number of layers).
- Value: 70B and 120B models with the same hidden size require the same VRAM. Rule of thumb: 70B needs ~64 GB RAM + 24 GB VRAM; 7B needs ~16 GB RAM + 8 GB VRAM. Mixtral 8x7B needs ~20 GB VRAM due to wide MLP.
- Trade-off: RAM usage scales with depth; VRAM is the bottleneck and depends on per-layer width.
- Action: Set `-rs` (RoPE scale) manually for models that need it (e.g., deepseek-coder uses `-rs 4`).
- Value: This setting is not automatically read from the model config.
- Trade-off: Incorrect RoPE scaling produces silently degraded calibration and output quality.
- Action: Conversion is resumable. If interrupted, rerun with the same output directory.
- Value: Progress is tracked in `job.json` in the working directory.
- Trade-off: No wasted work on interruption.
Reasoning
The measurement pass quantizes the entire model approximately 12 times with varying parameters to measure per-layer sensitivity to quantization error. This data feeds into the simulated annealing optimizer that allocates bits non-uniformly across layers. Since this sensitivity data is model-specific (not bitrate-specific), it can be reused for any target bitrate.
The perplexity check validates that quantization did not introduce catastrophic error. A perplexity >= 30 on the calibration sample typically indicates numerical issues (NaN Hessians, insufficient calibration data, or incorrect RoPE configuration).
From `doc/convert.md:88-91`:
The first pass is slow, since it effectively quantizes the entire model about 12 times over (albeit with a less comprehensive sample of the calibration dataset), so make sure to save the `measurement.json` file.
From `exllamav2/conversion/convert_exl2.py:66-67`:
if args.bits < 2 or args.bits > 8:
print(f" !! Warning: target bitrate {args.bits} will likely not be attainable")
From `exllamav2/conversion/adaptivegptq.py:298-302`:
# The Cholesky inverse will sometimes fail to compute due to accumulated rounding errors when H
# is very large (e.g. 70B MLP down proj) and a lot of calibration data is used (e.g. 100 rows of
# 4096 tokens). This won't always throw an exception and sometimes just results in a NaN tensor.
if torch.any(torch.isnan(hessian_inv)): raise RuntimeError
From `exllamav2/conversion/convert_exl2.py:55-56`:
if args.length > 2048 or args.measurement_length > 2048:
print(" !! Warning: calibration rows > 2048 tokens may result in excessive VRAM use")
Related Pages
- Implementation:Turboderp_org_Exllamav2_Measure_Quant
- Implementation:Turboderp_org_Exllamav2_Optimize_Bit_Allocation
- Implementation:Turboderp_org_Exllamav2_Quant_Layers
- Implementation:Turboderp_org_Exllamav2_Compile_Model
- Principle:Turboderp_org_Exllamav2_Quantization_Sensitivity_Measurement
- Principle:Turboderp_org_Exllamav2_Bit_Allocation_Optimization
- Principle:Turboderp_org_Exllamav2_Layer_Quantization