Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Principle:Mit han lab Llm awq Quantized Linear Module

From Leeroopedia
Revision as of 17:18, 16 February 2026 by Admin (talk | contribs) (Auto-imported from principles/Mit_han_lab_Llm_awq_Quantized_Linear_Module.md)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)

Overview

Custom neural network module that replaces standard linear layers with INT4 weight storage and hardware-accelerated dequantization during inference.

Description

The quantized linear module stores weights in packed INT4 format (4 weights per INT16 integer) along with FP16 scales and scaled_zeros per group. During the forward pass, it dynamically selects between GEMV (for small batch sizes <8, e.g. autoregressive decoding) and GEMM (for large batches, e.g. prefilling) CUDA kernels. The interleaved packing layout and split-k configuration are optimized for GPU memory access patterns. The module provides a from_linear() classmethod to convert standard nn.Linear layers.

Usage

Created automatically by real_quantize_model_weight() or load_awq_llama_fast(). Used for all quantized inference.

Theoretical Basis

W4A16 compute:

y = dequant(W_int4) @ x_fp16

Dequantization:

w_fp16 = (w_int4 - zero) * scale

GEMV vs GEMM selection is based on a batch size threshold of 8:

  • GEMV is used when batch size < 8 (e.g., single-token autoregressive decoding)
  • GEMM is used when batch size >= 8 (e.g., prompt prefilling)

Related Pages

Knowledge Sources

Domains

  • Quantization
  • Inference

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment