Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:Turboderp org Exllamav2 Bit Allocation Optimization

From Leeroopedia
Knowledge Sources
Domains Optimization, Quantization, Model_Compression
Last Updated 2026-02-15 00:00 GMT

Overview

Bit allocation optimization is the process of assigning quantization parameters to each layer of a neural network such that total model error is minimized subject to a global bitrate budget, using simulated annealing to explore the combinatorial solution space.

Description

After sensitivity measurement produces a per-layer error profile (a list of (total_bits, accuracy) options for each layer), the problem becomes: given a target average bits-per-weight (e.g., 4.125 bpw), choose one quantization configuration per layer such that:

  • The sum of bits across all layers does not exceed the total weight budget.
  • The total quantization error (aggregated across layers) is minimized.

This is a combinatorial optimization problem. With L layers and K options per layer, there are K^L possible assignments. Exhaustive search is infeasible for typical models with 60+ layers and 20+ options per layer. EXL2 solves this using simulated annealing, a probabilistic meta-heuristic inspired by the physical process of slowly cooling a material to reach a low-energy state.

The key insight of mixed-bitwidth quantization is that not all layers need the same precision. More sensitive layers (those with steep error curves) receive more bits, while robust layers are compressed more aggressively. This produces better overall quality than uniform quantization at the same average bitrate.

Usage

Bit allocation optimization is the third step in the EXL2 pipeline, executed after sensitivity measurement and before layer quantization. It typically completes in seconds to minutes, as it operates on the compact measurement data rather than model weights.

Theoretical Basis

Simulated Annealing

Simulated annealing (SA) is a probabilistic optimization algorithm that explores a solution space by:

  1. Starting with a random solution and a high temperature T.
  2. Proposing a neighbor solution by changing one layer's configuration.
  3. Accepting the neighbor if it improves the objective, or accepting it with probability exp(-delta_cost / T) if it worsens it.
  4. Cooling the temperature by a factor alpha after each iteration.
  5. Repeating until T falls below a minimum threshold.

The temperature schedule controls the exploration-exploitation trade-off: high temperatures allow uphill moves to escape local minima, while low temperatures refine the solution. The key SA parameters in EXL2 are:

Parameter Value Description
anneal_temp_max 2 Starting temperature
anneal_temp_min 0.0001 Stopping temperature
anneal_cooling_factor 0.995 Multiplicative cooling: T_new = T_old * 0.995
anneal_iter 1000 Iterations per annealing run
anneal_samples 80 Number of independent SA runs per stage
anneal_stages 3 Three-stage search strategy

Three-Stage Search

EXL2's optimizer runs simulated annealing in three stages:

  1. Stage 1 (broad sweep): Run 80 SA trials with the error norm parameter varying linearly across the range [1.5, 3.5]. This explores different trade-offs between minimizing the maximum per-layer error and minimizing the sum of errors.
  2. Stage 2 (refinement): Run 80 more SA trials with the norm parameter narrowed to a small window around the best norm found in Stage 1.
  3. Stage 3 (exploitation): Run 80 final SA trials all using the best norm value, collecting the lowest-cost solution.

Objective Function

The SA objective minimizes a norm-weighted aggregate of per-layer errors subject to the bit budget constraint:

minimize:  ( sum_l( error_l^p ) )^(1/p)
subject to: sum_l( bits_l ) <= weight_budget

where p is the error norm parameter. Higher p values emphasize minimizing the worst-case layer error (approaching minimax), while lower values focus on average error reduction.

Related Pages

Implemented By

Uses Heuristic

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment