Principle:Mit han lab Llm awq INT4 Weight Packing
Overview
Process of converting FP16 model weights to packed INT4 representation with group-wise quantization parameters for efficient inference.
Description
Real quantization converts FP16 weights to INT4 format using asymmetric quantization with zero-point. Weights are quantized per group (typically 128 elements): for each group, scales and zero-points are computed from the min/max range. The quantized integers are then packed into INT16 values (4 weights per INT16) with an interleaved layout optimized for GPU GEMM/GEMV kernels. The nn.Linear layers are replaced with WQLinear modules that store packed qweight, scales, and scaled_zeros buffers.
Usage
After AWQ transforms are applied, to create the final deployable quantized model.
Theoretical Basis
Asymmetric quantization:
q = round((w - min) / (max - min) * (2^n - 1))
Group-wise quantization with group_size=128. Interleaved 4-bit packing for efficient GPU kernel access patterns.
Related Pages
Knowledge Sources
- Paper|AWQ|https://arxiv.org/abs/2306.00978
Domains
- Quantization
- Model_Compression