Principle:Turboderp org Exllamav2 Bit Allocation Optimization

Knowledge Sources	Optimization by Simulated Annealing (Kirkpatrick et al., 1983) GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers
Domains	Optimization, Quantization, Model_Compression
Last Updated	2026-02-15 00:00 GMT

Overview

Bit allocation optimization is the process of assigning quantization parameters to each layer of a neural network such that total model error is minimized subject to a global bitrate budget, using simulated annealing to explore the combinatorial solution space.

Description

After sensitivity measurement produces a per-layer error profile (a list of (total_bits, accuracy) options for each layer), the problem becomes: given a target average bits-per-weight (e.g., 4.125 bpw), choose one quantization configuration per layer such that:

The sum of bits across all layers does not exceed the total weight budget.
The total quantization error (aggregated across layers) is minimized.

This is a combinatorial optimization problem. With L layers and K options per layer, there are K^L possible assignments. Exhaustive search is infeasible for typical models with 60+ layers and 20+ options per layer. EXL2 solves this using simulated annealing, a probabilistic meta-heuristic inspired by the physical process of slowly cooling a material to reach a low-energy state.

The key insight of mixed-bitwidth quantization is that not all layers need the same precision. More sensitive layers (those with steep error curves) receive more bits, while robust layers are compressed more aggressively. This produces better overall quality than uniform quantization at the same average bitrate.

Usage

Bit allocation optimization is the third step in the EXL2 pipeline, executed after sensitivity measurement and before layer quantization. It typically completes in seconds to minutes, as it operates on the compact measurement data rather than model weights.

Theoretical Basis

Simulated Annealing

Simulated annealing (SA) is a probabilistic optimization algorithm that explores a solution space by:

Starting with a random solution and a high temperature T.
Proposing a neighbor solution by changing one layer's configuration.
Accepting the neighbor if it improves the objective, or accepting it with probability exp(-delta_cost / T) if it worsens it.
Cooling the temperature by a factor alpha after each iteration.
Repeating until T falls below a minimum threshold.

The temperature schedule controls the exploration-exploitation trade-off: high temperatures allow uphill moves to escape local minima, while low temperatures refine the solution. The key SA parameters in EXL2 are:

Parameter	Value	Description
anneal_temp_max	2	Starting temperature
anneal_temp_min	0.0001	Stopping temperature
anneal_cooling_factor	0.995	Multiplicative cooling: T_new = T_old * 0.995
anneal_iter	1000	Iterations per annealing run
anneal_samples	80	Number of independent SA runs per stage
anneal_stages	3	Three-stage search strategy

Three-Stage Search

EXL2's optimizer runs simulated annealing in three stages:

Stage 1 (broad sweep): Run 80 SA trials with the error norm parameter varying linearly across the range [1.5, 3.5]. This explores different trade-offs between minimizing the maximum per-layer error and minimizing the sum of errors.
Stage 2 (refinement): Run 80 more SA trials with the norm parameter narrowed to a small window around the best norm found in Stage 1.
Stage 3 (exploitation): Run 80 final SA trials all using the best norm value, collecting the lowest-cost solution.

Objective Function

The SA objective minimizes a norm-weighted aggregate of per-layer errors subject to the bit budget constraint:

minimize:  ( sum_l( error_l^p ) )^(1/p)
subject to: sum_l( bits_l ) <= weight_budget

where p is the error norm parameter. Higher p values emphasize minimizing the worst-case layer error (approaching minimax), while lower values focus on average error reduction.

Related Pages

Implemented By

Implementation:Turboderp_org_Exllamav2_Optimize_Bit_Allocation

Uses Heuristic

Heuristic:Turboderp_org_Exllamav2_Quantization_Conversion_Tips

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment