Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:Mit han lab Llm awq INT4 Weight Packing

From Leeroopedia

Overview

Process of converting FP16 model weights to packed INT4 representation with group-wise quantization parameters for efficient inference.

Description

Real quantization converts FP16 weights to INT4 format using asymmetric quantization with zero-point. Weights are quantized per group (typically 128 elements): for each group, scales and zero-points are computed from the min/max range. The quantized integers are then packed into INT16 values (4 weights per INT16) with an interleaved layout optimized for GPU GEMM/GEMV kernels. The nn.Linear layers are replaced with WQLinear modules that store packed qweight, scales, and scaled_zeros buffers.

Usage

After AWQ transforms are applied, to create the final deployable quantized model.

Theoretical Basis

Asymmetric quantization:

q = round((w - min) / (max - min) * (2^n - 1))

Group-wise quantization with group_size=128. Interleaved 4-bit packing for efficient GPU kernel access patterns.

Related Pages

Knowledge Sources

Domains

  • Quantization
  • Model_Compression

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment