Principle:LLMBook zh LLMBook zh github io GPTQ Quantization
| Knowledge Sources | |
|---|---|
| Domains | Deep_Learning, Model_Compression, Inference |
| Last Updated | 2026-02-08 00:00 GMT |
Overview
A post-training quantization technique that uses calibration data and second-order information to minimize quantization error layer by layer.
Description
GPTQ (Generative Pre-trained Transformer Quantization) is an advanced post-training quantization method that quantizes model weights to 4-bit or lower while maintaining high model quality. Unlike simple round-to-nearest quantization, GPTQ uses a calibration dataset to estimate the Hessian (second-order gradient information) and adjusts remaining weights to compensate for quantization error in already-quantized weights.
Usage
Use GPTQ when you need aggressive quantization (4-bit or lower) with minimal quality degradation. It requires a calibration dataset but produces higher-quality quantized models than simple quantization methods.
Theoretical Basis
GPTQ operates layer by layer:
- For each layer, compute the Hessian matrix using calibration data.
- Quantize weights column by column.
- After quantizing each column, adjust remaining unquantized columns to compensate for the quantization error using the Hessian.
This is based on the Optimal Brain Quantization (OBQ) framework extended for efficiency.