Principle:Mit han lab Llm awq Quantized Linear Module
Overview
Custom neural network module that replaces standard linear layers with INT4 weight storage and hardware-accelerated dequantization during inference.
Description
The quantized linear module stores weights in packed INT4 format (4 weights per INT16 integer) along with FP16 scales and scaled_zeros per group. During the forward pass, it dynamically selects between GEMV (for small batch sizes <8, e.g. autoregressive decoding) and GEMM (for large batches, e.g. prefilling) CUDA kernels. The interleaved packing layout and split-k configuration are optimized for GPU memory access patterns. The module provides a from_linear() classmethod to convert standard nn.Linear layers.
Usage
Created automatically by real_quantize_model_weight() or load_awq_llama_fast(). Used for all quantized inference.
Theoretical Basis
W4A16 compute:
y = dequant(W_int4) @ x_fp16
Dequantization:
w_fp16 = (w_int4 - zero) * scale
GEMV vs GEMM selection is based on a batch size threshold of 8:
- GEMV is used when batch size < 8 (e.g., single-token autoregressive decoding)
- GEMM is used when batch size >= 8 (e.g., prompt prefilling)
Related Pages
Knowledge Sources
- Repo|llm-awq|https://github.com/mit-han-lab/llm-awq
- Paper|AWQ|https://arxiv.org/abs/2306.00978
Domains
- Quantization
- Inference