Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:Mit han lab Llm awq Per Channel Scaling Search

From Leeroopedia

Overview

Per-channel scaling search is an optimization technique that finds per-channel scaling factors to minimize quantization error for groups of linear layers sharing the same input activation.

Description

Within each transformer block of a large language model, multiple linear layers often share a common input. For example, the query (Q), key (K), and value (V) projection layers all receive the same output from a preceding LayerNorm. Similarly, the gate and up projections in a feed-forward block share their input.

Per-channel scaling search exploits this structure by jointly optimizing the scaling for all linear layers in a group. The algorithm identifies these groups by tracing which linear layers connect to a common preceding operation (LayerNorm, another linear layer, or an activation function).

For each group, the search proceeds as follows:

  • Compute activation statistics: From the cached calibration activations, compute the per-channel maximum absolute value, x_max, across all samples for the shared input.
  • Grid search over scaling ratios: Iterate over candidate scaling ratios alpha from 0 to 1 (exclusive) in 20 evenly-spaced steps. For each alpha:
    • Compute the scaling vector: s = x_max^alpha
    • Apply the scaling to the preceding operation's output channels: prev_op.weight /= s
    • Absorb the inverse scaling into the linear layers' input channels: linear.weight *= s
    • Quantize the modified weights and measure the block-level output MSE against the unquantized block output
    • Restore the original weights
  • Select optimal alpha: Choose the alpha that produced the minimum MSE across all candidates.

The scaling is equivalent in full precision because dividing the preceding operation's output by s and multiplying the linear layer's weights by s cancel out. However, after quantization, different scalings redistribute quantization error differently across channels, and the optimal scaling concentrates precision on the channels that matter most.

This approach is efficient because:

  • The grid search has only 20 candidates per group, making it fast
  • The search operates at the block level, capturing interactions between layers within the block
  • The scaling modifies existing operations (LayerNorm weights, linear biases) rather than introducing new parameters

Usage

Per-channel scaling search is used as a sub-step of the AWQ quantization pipeline. It is called once per transformer block, for each group of linked linear layers within that block.

Typical groups in a LLaMA-style transformer block include:

  • Self-attention projections: LayerNorm -> Q, K, V projections
  • Attention output: V output -> output projection
  • Feed-forward gate/up: LayerNorm -> gate projection, up projection
  • Feed-forward down: Up activation -> down projection

The search is invoked automatically by the AWQ orchestrator (run_awq) when the auto_scale flag is set to True.

Theoretical Basis

The mathematical formulation is:

s = x_max^alpha / norm_factor

where x_max is the per-channel maximum activation magnitude and norm_factor ensures numerical stability.

The scaling is applied as an equivalent transformation:

prev_op.weight /= s       (divide preceding operation's output by s)
linear.weight  *= s        (multiply target linear layer's input by s)

The grid search evaluates:

alpha in {0, 1/20, 2/20, 3/20, ..., 19/20}

For each candidate, the block-level quantization error is measured:

MSE = ||block(x) - block_quant(x)||^2

where block(x) is the original block output and block_quant(x) is the output after applying the candidate scaling and quantizing the weights.

The optimal scaling ratio balances two competing effects:

  • Higher alpha (closer to 1): Stronger scaling on high-activation channels, which reduces quantization error on salient weights but increases error on non-salient weights.
  • Lower alpha (closer to 0): Weaker scaling, approaching the unscaled (round-to-nearest) baseline.

The grid search finds the point where the net effect on block-level MSE is minimized.

Related Pages

Knowledge Sources

Domains

  • Quantization
  • Optimization

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment