Principle:InternLM Lmdeploy AWQ Weight Quantization

Knowledge Sources	AWQ: Activation-aware Weight Quantization LMDeploy W4A16 Quantization LMDeploy
Domains	Model_Compression, Quantization
Last Updated	2026-02-07 15:00 GMT

Overview

An activation-aware weight quantization algorithm that compresses model weights to 4-bit integers while preserving quality by protecting salient weight channels identified through activation analysis.

Description

AWQ (Activation-aware Weight Quantization) reduces LLM memory footprint by approximately 4x through 4-bit integer quantization of model weights (W4A16: 4-bit weights, 16-bit activations). The key insight is that not all weight channels are equally important: channels corresponding to large activation magnitudes have a disproportionate impact on output quality.

The AWQ algorithm:

Collects activation statistics from a calibration dataset
Identifies salient weight channels based on activation magnitudes
Applies per-group asymmetric quantization with scale search
Optionally searches for optimal scaling factors to minimize quantization error

AWQ-quantized models are served using the TurboMind backend with optimized INT4 GEMM kernels.

Usage

Use AWQ when you need to reduce model memory by ~4x for deployment on limited GPU memory. Preferred over GPTQ for most use cases due to better accuracy preservation and faster quantization. Requires a calibration dataset (default: WikiText-2, 128 samples).

Theoretical Basis

AWQ identifies salient channels using activation magnitudes and protects them during quantization:

$saliency (c) = 𝔼 [| X_{c} |] \cdot | W_{c} |$

Where $X_{c}$ is the activation for channel c and $W_{c}$ is the weight for channel c. High-saliency channels are quantized with finer granularity.

The quantization formula per group: Failed to parse (syntax error): {\displaystyle W_q = \text{round}\left(\frac{W - \text{zero\_point}}{\text{scale}}\right)}

With group size typically 128 (each group of 128 weights shares a scale/zero-point pair).

Related Pages

Implemented By

Implementation:InternLM_Lmdeploy_Auto_Awq

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment