Principle:Turboderp org Exllamav2 Bit Allocation Optimization
| Knowledge Sources | |
|---|---|
| Domains | Optimization, Quantization, Model_Compression |
| Last Updated | 2026-02-15 00:00 GMT |
Overview
Bit allocation optimization is the process of assigning quantization parameters to each layer of a neural network such that total model error is minimized subject to a global bitrate budget, using simulated annealing to explore the combinatorial solution space.
Description
After sensitivity measurement produces a per-layer error profile (a list of (total_bits, accuracy) options for each layer), the problem becomes: given a target average bits-per-weight (e.g., 4.125 bpw), choose one quantization configuration per layer such that:
- The sum of bits across all layers does not exceed the total weight budget.
- The total quantization error (aggregated across layers) is minimized.
This is a combinatorial optimization problem. With L layers and K options per layer, there are K^L possible assignments. Exhaustive search is infeasible for typical models with 60+ layers and 20+ options per layer. EXL2 solves this using simulated annealing, a probabilistic meta-heuristic inspired by the physical process of slowly cooling a material to reach a low-energy state.
The key insight of mixed-bitwidth quantization is that not all layers need the same precision. More sensitive layers (those with steep error curves) receive more bits, while robust layers are compressed more aggressively. This produces better overall quality than uniform quantization at the same average bitrate.
Usage
Bit allocation optimization is the third step in the EXL2 pipeline, executed after sensitivity measurement and before layer quantization. It typically completes in seconds to minutes, as it operates on the compact measurement data rather than model weights.
Theoretical Basis
Simulated Annealing
Simulated annealing (SA) is a probabilistic optimization algorithm that explores a solution space by:
- Starting with a random solution and a high temperature T.
- Proposing a neighbor solution by changing one layer's configuration.
- Accepting the neighbor if it improves the objective, or accepting it with probability
exp(-delta_cost / T)if it worsens it. - Cooling the temperature by a factor alpha after each iteration.
- Repeating until T falls below a minimum threshold.
The temperature schedule controls the exploration-exploitation trade-off: high temperatures allow uphill moves to escape local minima, while low temperatures refine the solution. The key SA parameters in EXL2 are:
| Parameter | Value | Description |
|---|---|---|
| anneal_temp_max | 2 | Starting temperature |
| anneal_temp_min | 0.0001 | Stopping temperature |
| anneal_cooling_factor | 0.995 | Multiplicative cooling: T_new = T_old * 0.995 |
| anneal_iter | 1000 | Iterations per annealing run |
| anneal_samples | 80 | Number of independent SA runs per stage |
| anneal_stages | 3 | Three-stage search strategy |
Three-Stage Search
EXL2's optimizer runs simulated annealing in three stages:
- Stage 1 (broad sweep): Run 80 SA trials with the error norm parameter varying linearly across the range [1.5, 3.5]. This explores different trade-offs between minimizing the maximum per-layer error and minimizing the sum of errors.
- Stage 2 (refinement): Run 80 more SA trials with the norm parameter narrowed to a small window around the best norm found in Stage 1.
- Stage 3 (exploitation): Run 80 final SA trials all using the best norm value, collecting the lowest-cost solution.
Objective Function
The SA objective minimizes a norm-weighted aggregate of per-layer errors subject to the bit budget constraint:
minimize: ( sum_l( error_l^p ) )^(1/p)
subject to: sum_l( bits_l ) <= weight_budget
where p is the error norm parameter. Higher p values emphasize minimizing the worst-case layer error (approaching minimax), while lower values focus on average error reduction.