Principle:Mit han lab Llm awq W8A8 Quantized Linear

Knowledge Sources	Mit_han_lab_Llm_awq QServe
Domains	Quantization, Model_Architecture
Last Updated	2026-02-15 00:00 GMT

Overview

Principle of implementing INT8 weight and INT8 activation linear layers with dynamic input scaling and FP16 output for efficient inference.

Description

W8A8 quantized linear layers store weights in INT8 format and quantize activations to INT8 at runtime. The dynamic input scaling variant computes per-token scales to handle varying activation magnitudes. The output is accumulated in FP16 to maintain downstream precision. A factory method (from_linear) converts pre-trained FP16 linear layers, and a special from_qkv method fuses separate Q, K, V projections into a single quantized layer.

Usage

Apply this principle when replacing FP16 linear layers with INT8 versions to reduce memory footprint and leverage INT8 tensor cores.

Theoretical Basis

Given FP16 weight W and input x:

Static weight quantization: W_q = round(W / s_w), s_w pre-computed
Dynamic activation quantization: x_q = round(x / s_x), s_x = max(|x_row|) / 127
INT8 GEMM: y_int = W_q @ x_q
Dequantize: y = y_int * (s_w * s_x) + bias

Related Pages

Implementation:Mit_han_lab_Llm_awq_W8A8OF16LinearDynamicInputScale

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment