Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Principle:Mit han lab Llm awq W8A8 Quantized Linear

From Leeroopedia
Knowledge Sources
Domains Quantization, Model_Architecture
Last Updated 2026-02-15 00:00 GMT

Overview

Principle of implementing INT8 weight and INT8 activation linear layers with dynamic input scaling and FP16 output for efficient inference.

Description

W8A8 quantized linear layers store weights in INT8 format and quantize activations to INT8 at runtime. The dynamic input scaling variant computes per-token scales to handle varying activation magnitudes. The output is accumulated in FP16 to maintain downstream precision. A factory method (from_linear) converts pre-trained FP16 linear layers, and a special from_qkv method fuses separate Q, K, V projections into a single quantized layer.

Usage

Apply this principle when replacing FP16 linear layers with INT8 versions to reduce memory footprint and leverage INT8 tensor cores.

Theoretical Basis

Given FP16 weight W and input x:

  • Static weight quantization: W_q = round(W / s_w), s_w pre-computed
  • Dynamic activation quantization: x_q = round(x / s_x), s_x = max(|x_row|) / 127
  • INT8 GEMM: y_int = W_q @ x_q
  • Dequantize: y = y_int * (s_w * s_x) + bias

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment