Principle:Mit han lab Llm awq W8A8 Vision Encoder Quantization
| Knowledge Sources | |
|---|---|
| Domains | Quantization, Vision |
| Last Updated | 2026-02-15 00:00 GMT |
Overview
Principle of quantizing vision transformer encoder layers to 8-bit weights and 8-bit activations (W8A8) for efficient multimodal inference.
Description
W8A8 vision encoder quantization replaces floating-point linear layers in vision transformer blocks with INT8 quantized versions. Per-token dynamic input scaling preserves accuracy by computing activation scales at runtime. The approach fuses RMSNorm with quantization to reduce kernel launch overhead. This enables running large vision encoders (InternViT-6B, SigLIP) within the memory and compute budgets of edge deployment.
Usage
Apply this principle when deploying multimodal models where the vision encoder is a significant memory and compute bottleneck.
Theoretical Basis
For a linear layer y = Wx + b:
- Weights W are statically quantized to INT8
- Activations x are dynamically quantized per-token: x_q = round(x / s_x), where s_x = max(|x|) / 127
- Output is computed in INT8 and dequantized: y = dequant(W_q @ x_q) * s_w * s_x + b