Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:Mit han lab Llm awq Pseudo Quantization

From Leeroopedia

Overview

Technique that simulates the numerical effects of low-bit quantization on FP16 weights without actually packing them, enabling accuracy evaluation.

Description

Pseudo quantization applies the quantize-then-dequantize operation to weight tensors: weights are rounded to n-bit grid values and then scaled back to FP16. This introduces the same quantization noise as real INT4 deployment but keeps weights in FP16 format, allowing standard PyTorch inference without custom CUDA kernels. Used for evaluating quantization quality (perplexity, benchmarks) before committing to real quantization.

Usage

When evaluating AWQ quantization quality without deploying with custom kernels (--q_backend fake mode).

Theoretical Basis

w_fake = dequant(quant(w)) = ((round(w/s) + z).clamp(min, max) - z) * s

where:

  • s = (max - min) / (2^n - 1)
  • z = round(-min / s)

Related Pages

Knowledge Sources

Domains

  • Quantization
  • Evaluation

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment