Principle:Mit han lab Llm awq Quantized Linear Module

Overview

Custom neural network module that replaces standard linear layers with INT4 weight storage and hardware-accelerated dequantization during inference.

Description

The quantized linear module stores weights in packed INT4 format (4 weights per INT16 integer) along with FP16 scales and scaled_zeros per group. During the forward pass, it dynamically selects between GEMV (for small batch sizes <8, e.g. autoregressive decoding) and GEMM (for large batches, e.g. prefilling) CUDA kernels. The interleaved packing layout and split-k configuration are optimized for GPU memory access patterns. The module provides a from_linear() classmethod to convert standard nn.Linear layers.

Usage

Created automatically by real_quantize_model_weight() or load_awq_llama_fast(). Used for all quantized inference.

Theoretical Basis

W4A16 compute:

y = dequant(W_int4) @ x_fp16

Dequantization:

w_fp16 = (w_int4 - zero) * scale

GEMV vs GEMM selection is based on a batch size threshold of 8:

GEMV is used when batch size < 8 (e.g., single-token autoregressive decoding)
GEMM is used when batch size >= 8 (e.g., prompt prefilling)

Related Pages

Implementation:Mit_han_lab_Llm_awq_WQLinear_from_linear

Knowledge Sources

Repo|llm-awq|https://github.com/mit-han-lab/llm-awq
Paper|AWQ|https://arxiv.org/abs/2306.00978

Domains

Quantization
Inference

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment