Heuristic:Mit han lab Llm awq AWQ Grid Search Tuning
| Knowledge Sources | |
|---|---|
| Domains | Quantization, Optimization |
| Last Updated | 2026-02-15 01:00 GMT |
Overview
Grid search parameters for AWQ calibration: 20-point grid for scaling/clipping, 512 calibration samples, 50% max shrink, and adaptive output-channel batching to prevent OOM.
Description
The AWQ quantization pipeline uses grid search at two critical stages: per-channel scaling (finding optimal activation-weight scale ratios) and weight clipping (finding optimal clipping thresholds). Both use a 20-point grid (`n_grid=20`) to balance search thoroughness with computation cost. The scaling search evaluates all 20 ratios from 0.0 to 1.0 (representing the interpolation between activation-based and weight-based scaling), while the clipping search only evaluates the first 50% of the grid (`max_shrink=0.5`) to avoid excessive weight magnitude reduction. Calibration uses 512 samples of 512-token sequences by default.
Usage
Apply these parameters when running AWQ quantization via `run_awq()`. These defaults work well for most LLM architectures. Consider adjusting `n_samples` (128 for faster experimentation via CLI, 512 for thorough calibration via library) and `oc_batch_size` if encountering OOM during the clipping search.
The Insight (Rule of Thumb)
- Grid Resolution: `n_grid=20` provides sufficient granularity for both scaling and clipping search. Finer grids (40, 80) yield diminishing returns.
- Max Shrink: `max_shrink=0.5` limits weight clipping to at most 50% of original magnitude. This avoids excessive information loss while still finding useful clipping thresholds.
- Calibration Size: `n_samples=512, seqlen=512` (262K total tokens) is the library default. The CLI uses `n_samples=128` for faster iteration.
- OC Batch Size: `oc_batch_size=256` for output channels divisible by 256; falls back to 64 otherwise. This prevents OOM during the per-group clipping search.
- Scale Normalization: Scales are normalized by the geometric mean (`scales / sqrt(max * min)`) to prevent extreme scaling factors.
- Numerical Floor: `clamp(min=1e-4)` in scaling and `clamp(min=1e-5)` in quantization prevent division-by-zero in scale computation.
- Trade-off: The grid search adds ~10-30 minutes to quantization time per model but significantly improves quantized model quality.
Reasoning
The 20-point grid is an empirical choice from the AWQ paper. The scaling search explores the ratio `r` in `s = x_max^r` where `r in [0, 1/20, 2/20, ..., 19/20]`. At `r=0`, scales are uniform (no protection); at `r=1`, scales fully protect salient channels. The search picks the ratio minimizing MSE between full-precision and quantized block outputs. The `max_shrink=0.5` bound on clipping ensures weights are never reduced by more than half, which empirically preserves model quality while still enabling tighter quantization ranges. The adaptive batch size (256 vs 64) is a memory-aware heuristic: processing 256 output channels simultaneously is faster but requires more VRAM; falling back to 64 prevents OOM on smaller GPUs.
# From awq/quantize/auto_scale.py:124-131
n_grid = 20
org_sd = {k: v.cpu() for k, v in block.state_dict().items()}
for ratio in range(n_grid):
ratio = ratio * 1 / n_grid
scales = x_max.pow(ratio).clamp(min=1e-4).view(-1)
scales = scales / (scales.max() * scales.min()).sqrt()
# From awq/quantize/auto_clip.py:12,26,41
def auto_clip_layer(w, input_feat, n_bit, q_config,
n_grid=20, max_shrink=0.5, n_sample_token=512):
oc_batch_size = 256 if w.shape[0] % 256 == 0 else 64 # prevent OOM
for i_s in range(int(max_shrink * n_grid)):
max_val = org_max_val * (1 - i_s / n_grid)