Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:LLMBook zh LLMBook zh github io GPTQConfig Quantization

From Leeroopedia


Knowledge Sources
Domains Deep_Learning, Model_Compression, Inference
Last Updated 2026-02-08 00:00 GMT

Overview

Concrete tool for GPTQ 4-bit quantization with calibration using HuggingFace Transformers and auto-gptq.

Description

GPTQConfig defines GPTQ quantization parameters (bit width, calibration dataset, tokenizer), and AutoModelForCausalLM.from_pretrained with quantization_config applies GPTQ quantization at load time. The calibration dataset (e.g., "c4") is used to compute Hessian information for optimal weight quantization.

This is a Wrapper Doc documenting how the LLMBook repository uses HuggingFace's GPTQ integration.

Usage

Use this for aggressive 4-bit quantization when you have a calibration dataset available.

Code Reference

Source Location

  • Repository: LLMBook-zh
  • File: code/9.4 GPTQ实践.py
  • Lines: 1-10

Signature

# Configure GPTQ
tokenizer = AutoTokenizer.from_pretrained(name: str)
quantization_config = GPTQConfig(
    bits: int = 4,
    dataset: str = "c4",
    tokenizer: AutoTokenizer = tokenizer,
)

# Load with GPTQ quantization
model = AutoModelForCausalLM.from_pretrained(
    name: str,
    device_map: str = "auto",
    quantization_config: GPTQConfig = quantization_config,
)

Import

from transformers import AutoModelForCausalLM, AutoTokenizer, GPTQConfig

External Reference

I/O Contract

Inputs

Name Type Required Description
bits int Yes Quantization bit width (e.g., 4)
dataset str Yes Calibration dataset name (e.g., "c4")
tokenizer AutoTokenizer Yes Tokenizer for calibration data processing
name str Yes Model ID to quantize

Outputs

Name Type Description
return PreTrainedModel GPTQ-quantized model loaded on GPU

Usage Examples

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, GPTQConfig

name = "yulan-team/YuLan-Chat-2-13b-fp16"

# Setup GPTQ
tokenizer = AutoTokenizer.from_pretrained(name)
quantization_config = GPTQConfig(bits=4, dataset="c4", tokenizer=tokenizer)

# Load quantized model
model = AutoModelForCausalLM.from_pretrained(
    name,
    device_map="auto",
    quantization_config=quantization_config
)
print(f"GPTQ 4-bit memory: {torch.cuda.memory_allocated()/1e9:.2f} GB")

Related Pages

Implements Principle

Requires Environment

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment