Implementation:Turboderp org Exllamav2 Optimize Bit Allocation
| Knowledge Sources | |
|---|---|
| Domains | Optimization, Quantization, Model_Compression |
| Last Updated | 2026-02-15 00:00 GMT |
Overview
Concrete tool for finding the optimal per-layer quantization parameter assignment under a global bitrate budget provided by exllamav2.
Description
The optimize function takes the per-layer measurement data (accuracy vs. total_bits for each candidate configuration) and a target bits-per-weight, then uses a C++ simulated annealing implementation (exllamav2_ext.sim_anneal) to find the assignment of quantization parameters to each layer that minimizes overall error without exceeding the bit budget. The optimizer runs in three stages: a broad norm sweep, a refined norm sweep, and a final exploitation pass. After the SA solver converges, a greedy pass allocates any remaining bit budget by upgrading layers one step at a time.
Usage
Call optimize after measure_quant has populated job["measurement"]. The result is stored in job["strategy"], which is consumed by the quant function in the next pipeline step.
Code Reference
Source Location
- Repository: exllamav2
- File:
exllamav2/conversion/optimize.py - Lines: L8-189
Signature
def optimize(job, save_fn, model):
Import
from exllamav2.conversion.optimize import optimize
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| job | dict | Yes | Conversion job state. Key fields: job["measurement"] (dict from measure_quant with per-layer error profiles), job["bits"] (float, target average bits-per-weight, e.g., 4.125)
|
| save_fn | callable | Yes | Callback to persist job state to disk |
| model | ExLlamaV2 | Yes | The loaded model instance (used to retrieve layer shapes, module keys, and architecture config) |
Outputs
| Name | Type | Description |
|---|---|---|
| job["strategy"] | dict (side effect) | Maps layer keys (e.g., model.layers.0.self_attn, model.layers.0.mlp) to the chosen quantization parameter dict. Each entry includes accuracy, total_bits, and per-projection QParams dicts (q_proj, k_proj, etc.)
|
Internal Constants
| Constant | Value | Description |
|---|---|---|
| norm_interval | (1.5, 3.5) | Range of error norm values explored in Stage 1 |
| norm_2ndstage | 0.15 | Width of the refined norm window in Stage 2 |
| anneal_temp_max | 2 | Starting temperature for simulated annealing |
| anneal_temp_min | 0.0001 | Stopping temperature |
| anneal_cooling_factor | 0.995 | Multiplicative cooling factor per iteration |
| anneal_iter | 1000 | Number of SA iterations per run |
| anneal_samples | 80 | Number of independent SA runs per stage |
| anneal_stages | 3 | Total number of optimization stages |
Algorithm Walkthrough
Step 1: Compute Weight Budget
numel = sum(m.numel() for m in model.modules[first_q_layer : num_modules + first_q_layer])
weight_budget = int(numel * target_bpw)
Step 2: Compile Slot Options
Each layer becomes a "slot" with a list of (total_bits, error) options:
for opt in measurement_results:
slot.append((int(opt["total_bits"]), 1 - opt["accuracy"]))
Step 3: Three-Stage Simulated Annealing
# Stage 1: Broad norm sweep (80 samples, norm in [1.5, 3.5])
# Stage 2: Refined norm sweep (80 samples, norm in [bestnorm-0.075, bestnorm+0.075])
# Stage 3: Exploitation (80 samples, all at bestnorm)
s_, si_, p_, c_, m_ = ext_c.sim_anneal(
slots, weight_budget,
anneal_temp_max, anneal_cooling_factor,
anneal_temp_min, anneal_iter, norm
)
Step 4: Greedy Budget Remainder Allocation
After SA converges, any leftover bit budget is spent by greedily upgrading layers:
while True:
repeat = False
for i in range(len(si)):
if si[i] < len(slots[i]) - 1:
delta_c = slots[i][si[i] + 1][0] - slots[i][si[i]][0]
if c + delta_c <= weight_budget:
c += delta_c
si[i] = si[i] + 1
repeat = True
if not repeat:
break
Usage Examples
Basic Example
from exllamav2.conversion.optimize import optimize
# After measure_quant has populated job["measurement"]
job["bits"] = 4.125 # Target average bits-per-weight
optimize(job, save_fn, model)
# Result: job["strategy"] maps each layer to its optimal QParams
for layer_key, params in job["strategy"].items():
bpw = params["total_bits"] / layer_numel
print(f"{layer_key}: {bpw:.4f} bpw, accuracy: {params['accuracy']:.8f}")
Dependencies
- exllamav2_ext (C++ extension) -- provides the
sim_annealfunction for fast simulated annealing - QParams -- quantization parameter dataclass for interpreting measurement results
- math, itertools, time -- standard library utilities