Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:LLMBook zh LLMBook zh github io GPTQ Quantization

From Leeroopedia


Knowledge Sources
Domains Deep_Learning, Model_Compression, Inference
Last Updated 2026-02-08 00:00 GMT

Overview

A post-training quantization technique that uses calibration data and second-order information to minimize quantization error layer by layer.

Description

GPTQ (Generative Pre-trained Transformer Quantization) is an advanced post-training quantization method that quantizes model weights to 4-bit or lower while maintaining high model quality. Unlike simple round-to-nearest quantization, GPTQ uses a calibration dataset to estimate the Hessian (second-order gradient information) and adjusts remaining weights to compensate for quantization error in already-quantized weights.

Usage

Use GPTQ when you need aggressive quantization (4-bit or lower) with minimal quality degradation. It requires a calibration dataset but produces higher-quality quantized models than simple quantization methods.

Theoretical Basis

GPTQ operates layer by layer:

  1. For each layer, compute the Hessian matrix using calibration data.
  2. Quantize weights column by column.
  3. After quantizing each column, adjust remaining unquantized columns to compensate for the quantization error using the Hessian.

This is based on the Optimal Brain Quantization (OBQ) framework extended for efficiency.

Related Pages

Implemented By

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment