Principle:Mit han lab Llm awq INT4 Weight Packing

Overview

Process of converting FP16 model weights to packed INT4 representation with group-wise quantization parameters for efficient inference.

Description

Real quantization converts FP16 weights to INT4 format using asymmetric quantization with zero-point. Weights are quantized per group (typically 128 elements): for each group, scales and zero-points are computed from the min/max range. The quantized integers are then packed into INT16 values (4 weights per INT16) with an interleaved layout optimized for GPU GEMM/GEMV kernels. The nn.Linear layers are replaced with WQLinear modules that store packed qweight, scales, and scaled_zeros buffers.

Usage

After AWQ transforms are applied, to create the final deployable quantized model.

Theoretical Basis

Asymmetric quantization:

q = round((w - min) / (max - min) * (2^n - 1))

Group-wise quantization with group_size=128. Interleaved 4-bit packing for efficient GPU kernel access patterns.

Related Pages

Implementation:Mit_han_lab_Llm_awq_Real_quantize_model_weight

Knowledge Sources

Paper|AWQ|https://arxiv.org/abs/2306.00978

Domains

Quantization
Model_Compression

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment