Principle:Mit han lab Llm awq Pseudo Quantization

Overview

Technique that simulates the numerical effects of low-bit quantization on FP16 weights without actually packing them, enabling accuracy evaluation.

Description

Pseudo quantization applies the quantize-then-dequantize operation to weight tensors: weights are rounded to n-bit grid values and then scaled back to FP16. This introduces the same quantization noise as real INT4 deployment but keeps weights in FP16 format, allowing standard PyTorch inference without custom CUDA kernels. Used for evaluating quantization quality (perplexity, benchmarks) before committing to real quantization.

Usage

When evaluating AWQ quantization quality without deploying with custom kernels (--q_backend fake mode).

Theoretical Basis

w_fake = dequant(quant(w)) = ((round(w/s) + z).clamp(min, max) - z) * s

where:

s = (max - min) / (2^n - 1)
z = round(-min / s)

Related Pages

Implementation:Mit_han_lab_Llm_awq_Pseudo_quantize_model_weight

Knowledge Sources

Paper|AWQ|https://arxiv.org/abs/2306.00978

Domains

Quantization
Evaluation

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment