Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:InternLM Lmdeploy AWQ Weight Quantization

From Leeroopedia


Knowledge Sources
Domains Model_Compression, Quantization
Last Updated 2026-02-07 15:00 GMT

Overview

An activation-aware weight quantization algorithm that compresses model weights to 4-bit integers while preserving quality by protecting salient weight channels identified through activation analysis.

Description

AWQ (Activation-aware Weight Quantization) reduces LLM memory footprint by approximately 4x through 4-bit integer quantization of model weights (W4A16: 4-bit weights, 16-bit activations). The key insight is that not all weight channels are equally important: channels corresponding to large activation magnitudes have a disproportionate impact on output quality.

The AWQ algorithm:

  1. Collects activation statistics from a calibration dataset
  2. Identifies salient weight channels based on activation magnitudes
  3. Applies per-group asymmetric quantization with scale search
  4. Optionally searches for optimal scaling factors to minimize quantization error

AWQ-quantized models are served using the TurboMind backend with optimized INT4 GEMM kernels.

Usage

Use AWQ when you need to reduce model memory by ~4x for deployment on limited GPU memory. Preferred over GPTQ for most use cases due to better accuracy preservation and faster quantization. Requires a calibration dataset (default: WikiText-2, 128 samples).

Theoretical Basis

AWQ identifies salient channels using activation magnitudes and protects them during quantization:

saliency(c)=𝔼[|Xc|]|Wc|

Where Xc is the activation for channel c and Wc is the weight for channel c. High-saliency channels are quantized with finer granularity.

The quantization formula per group: Failed to parse (syntax error): {\displaystyle W_q = \text{round}\left(\frac{W - \text{zero\_point}}{\text{scale}}\right)}

With group size typically 128 (each group of 128 weights shares a scale/zero-point pair).

Related Pages

Implemented By

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment