Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Heuristic:Turboderp org Exllamav2 Quantization Conversion Tips

From Leeroopedia
Knowledge Sources
Domains Quantization, Model_Optimization
Last Updated 2026-02-15 00:00 GMT

Overview

Practical guidance for EXL2 model quantization: save measurement files (12x speedup on re-quant), use 2-8 bpw target range, check perplexity < 30 after conversion, and note that VRAM needs depend on model width not depth.

Description

EXL2 quantization converts full-precision models into mixed-bitrate quantized formats using an adaptive GPTQ algorithm with simulated annealing for optimal bit allocation. The process has two expensive passes (measurement and quantization) and several non-obvious rules that significantly affect quality and resource usage.

Usage

Apply these tips when running EXL2 model conversion via `convert_exl2.py`. The measurement pass is by far the most expensive operation, so understanding how to reuse it and validate results is critical.

The Insight (Rule of Thumb)

  • Action: Always save the `measurement.json` file from the first conversion pass.
  • Value: The measurement pass is approximately 12x slower than the quantization pass.
  • Trade-off: Storing measurement.json allows skipping the measurement pass entirely on subsequent quants of the same model at different bitrates.
  • Action: Target bitrate between 2 and 8 bpw (bits per weight).
  • Value: Values outside this range trigger a warning. Head layer quantization is only useful at 6 and 8 bpw (6 bpw is the default, yielding mixed ~6.3 bpw effective).
  • Trade-off: Lower bpw = smaller model but more quality loss. Below 2 bpw or above 8 bpw, the optimizer struggles to find valid allocations.
  • Action: Check "calibration perplexity (quant)" after conversion.
  • Value: Should be well below 30. If >= 30, quantization likely failed. If in the thousands, it failed catastrophically.
  • Trade-off: N/A -- this is a quality validation check.
  • Action: VRAM requirements are determined by model width (hidden size), not depth (number of layers).
  • Value: 70B and 120B models with the same hidden size require the same VRAM. Rule of thumb: 70B needs ~64 GB RAM + 24 GB VRAM; 7B needs ~16 GB RAM + 8 GB VRAM. Mixtral 8x7B needs ~20 GB VRAM due to wide MLP.
  • Trade-off: RAM usage scales with depth; VRAM is the bottleneck and depends on per-layer width.
  • Action: Set `-rs` (RoPE scale) manually for models that need it (e.g., deepseek-coder uses `-rs 4`).
  • Value: This setting is not automatically read from the model config.
  • Trade-off: Incorrect RoPE scaling produces silently degraded calibration and output quality.
  • Action: Conversion is resumable. If interrupted, rerun with the same output directory.
  • Value: Progress is tracked in `job.json` in the working directory.
  • Trade-off: No wasted work on interruption.

Reasoning

The measurement pass quantizes the entire model approximately 12 times with varying parameters to measure per-layer sensitivity to quantization error. This data feeds into the simulated annealing optimizer that allocates bits non-uniformly across layers. Since this sensitivity data is model-specific (not bitrate-specific), it can be reused for any target bitrate.

The perplexity check validates that quantization did not introduce catastrophic error. A perplexity >= 30 on the calibration sample typically indicates numerical issues (NaN Hessians, insufficient calibration data, or incorrect RoPE configuration).

From `doc/convert.md:88-91`:

The first pass is slow, since it effectively quantizes the entire model about 12 times over (albeit with a less comprehensive sample of the calibration dataset), so make sure to save the `measurement.json` file.

From `exllamav2/conversion/convert_exl2.py:66-67`:

if args.bits < 2 or args.bits > 8:
    print(f" !! Warning: target bitrate {args.bits} will likely not be attainable")

From `exllamav2/conversion/adaptivegptq.py:298-302`:

# The Cholesky inverse will sometimes fail to compute due to accumulated rounding errors when H
# is very large (e.g. 70B MLP down proj) and a lot of calibration data is used (e.g. 100 rows of
# 4096 tokens). This won't always throw an exception and sometimes just results in a NaN tensor.
if torch.any(torch.isnan(hessian_inv)): raise RuntimeError

From `exllamav2/conversion/convert_exl2.py:55-56`:

if args.length > 2048 or args.measurement_length > 2048:
    print(" !! Warning: calibration rows > 2048 tokens may result in excessive VRAM use")

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment