Implementation:LLMBook zh LLMBook zh github io GPTQConfig Quantization
| Knowledge Sources | |
|---|---|
| Domains | Deep_Learning, Model_Compression, Inference |
| Last Updated | 2026-02-08 00:00 GMT |
Overview
Concrete tool for GPTQ 4-bit quantization with calibration using HuggingFace Transformers and auto-gptq.
Description
GPTQConfig defines GPTQ quantization parameters (bit width, calibration dataset, tokenizer), and AutoModelForCausalLM.from_pretrained with quantization_config applies GPTQ quantization at load time. The calibration dataset (e.g., "c4") is used to compute Hessian information for optimal weight quantization.
This is a Wrapper Doc documenting how the LLMBook repository uses HuggingFace's GPTQ integration.
Usage
Use this for aggressive 4-bit quantization when you have a calibration dataset available.
Code Reference
Source Location
- Repository: LLMBook-zh
- File: code/9.4 GPTQ实践.py
- Lines: 1-10
Signature
# Configure GPTQ
tokenizer = AutoTokenizer.from_pretrained(name: str)
quantization_config = GPTQConfig(
bits: int = 4,
dataset: str = "c4",
tokenizer: AutoTokenizer = tokenizer,
)
# Load with GPTQ quantization
model = AutoModelForCausalLM.from_pretrained(
name: str,
device_map: str = "auto",
quantization_config: GPTQConfig = quantization_config,
)
Import
from transformers import AutoModelForCausalLM, AutoTokenizer, GPTQConfig
External Reference
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| bits | int | Yes | Quantization bit width (e.g., 4) |
| dataset | str | Yes | Calibration dataset name (e.g., "c4") |
| tokenizer | AutoTokenizer | Yes | Tokenizer for calibration data processing |
| name | str | Yes | Model ID to quantize |
Outputs
| Name | Type | Description |
|---|---|---|
| return | PreTrainedModel | GPTQ-quantized model loaded on GPU |
Usage Examples
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, GPTQConfig
name = "yulan-team/YuLan-Chat-2-13b-fp16"
# Setup GPTQ
tokenizer = AutoTokenizer.from_pretrained(name)
quantization_config = GPTQConfig(bits=4, dataset="c4", tokenizer=tokenizer)
# Load quantized model
model = AutoModelForCausalLM.from_pretrained(
name,
device_map="auto",
quantization_config=quantization_config
)
print(f"GPTQ 4-bit memory: {torch.cuda.memory_allocated()/1e9:.2f} GB")