Principle:Mit han lab Llm awq Per Channel Scaling Search

Overview

Per-channel scaling search is an optimization technique that finds per-channel scaling factors to minimize quantization error for groups of linear layers sharing the same input activation.

Description

Within each transformer block of a large language model, multiple linear layers often share a common input. For example, the query (Q), key (K), and value (V) projection layers all receive the same output from a preceding LayerNorm. Similarly, the gate and up projections in a feed-forward block share their input.

Per-channel scaling search exploits this structure by jointly optimizing the scaling for all linear layers in a group. The algorithm identifies these groups by tracing which linear layers connect to a common preceding operation (LayerNorm, another linear layer, or an activation function).

For each group, the search proceeds as follows:

Compute activation statistics: From the cached calibration activations, compute the per-channel maximum absolute value, x_max, across all samples for the shared input.
Grid search over scaling ratios: Iterate over candidate scaling ratios alpha from 0 to 1 (exclusive) in 20 evenly-spaced steps. For each alpha:
- Compute the scaling vector: s = x_max^alpha
- Apply the scaling to the preceding operation's output channels: prev_op.weight /= s
- Absorb the inverse scaling into the linear layers' input channels: linear.weight *= s
- Quantize the modified weights and measure the block-level output MSE against the unquantized block output
- Restore the original weights
Select optimal alpha: Choose the alpha that produced the minimum MSE across all candidates.

The scaling is equivalent in full precision because dividing the preceding operation's output by s and multiplying the linear layer's weights by s cancel out. However, after quantization, different scalings redistribute quantization error differently across channels, and the optimal scaling concentrates precision on the channels that matter most.

This approach is efficient because:

The grid search has only 20 candidates per group, making it fast
The search operates at the block level, capturing interactions between layers within the block
The scaling modifies existing operations (LayerNorm weights, linear biases) rather than introducing new parameters

Usage

Per-channel scaling search is used as a sub-step of the AWQ quantization pipeline. It is called once per transformer block, for each group of linked linear layers within that block.

Typical groups in a LLaMA-style transformer block include:

Self-attention projections: LayerNorm -> Q, K, V projections
Attention output: V output -> output projection
Feed-forward gate/up: LayerNorm -> gate projection, up projection
Feed-forward down: Up activation -> down projection

The search is invoked automatically by the AWQ orchestrator (run_awq) when the auto_scale flag is set to True.

Theoretical Basis

The mathematical formulation is:

s = x_max^alpha / norm_factor

where x_max is the per-channel maximum activation magnitude and norm_factor ensures numerical stability.

The scaling is applied as an equivalent transformation:

prev_op.weight /= s       (divide preceding operation's output by s)
linear.weight  *= s        (multiply target linear layer's input by s)

The grid search evaluates:

alpha in {0, 1/20, 2/20, 3/20, ..., 19/20}

For each candidate, the block-level quantization error is measured:

MSE = ||block(x) - block_quant(x)||^2

where block(x) is the original block output and block_quant(x) is the output after applying the candidate scaling and quantizing the weights.

The optimal scaling ratio balances two competing effects:

Higher alpha (closer to 1): Stronger scaling on high-activation channels, which reduces quantization error on salient weights but increases error on non-salient weights.
Lower alpha (closer to 0): Weaker scaling, approaching the unscaled (round-to-nearest) baseline.

The grid search finds the point where the net effect on block-level MSE is minimized.

Related Pages

Knowledge Sources

Paper|AWQ|https://arxiv.org/abs/2306.00978

Domains

Quantization
Optimization

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment