Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Principle:Mit han lab Llm awq Pseudo Quantization

From Leeroopedia
Revision as of 17:26, 16 February 2026 by Admin (talk | contribs) (Auto-imported from principles/Mit_han_lab_Llm_awq_Pseudo_Quantization.md)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)

Overview

Technique that simulates the numerical effects of low-bit quantization on FP16 weights without actually packing them, enabling accuracy evaluation.

Description

Pseudo quantization applies the quantize-then-dequantize operation to weight tensors: weights are rounded to n-bit grid values and then scaled back to FP16. This introduces the same quantization noise as real INT4 deployment but keeps weights in FP16 format, allowing standard PyTorch inference without custom CUDA kernels. Used for evaluating quantization quality (perplexity, benchmarks) before committing to real quantization.

Usage

When evaluating AWQ quantization quality without deploying with custom kernels (--q_backend fake mode).

Theoretical Basis

w_fake = dequant(quant(w)) = ((round(w/s) + z).clamp(min, max) - z) * s

where:

  • s = (max - min) / (2^n - 1)
  • z = round(-min / s)

Related Pages

Knowledge Sources

Domains

  • Quantization
  • Evaluation

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment