Principle:Huggingface Optimum Calibration Data Preparation
Overview
Process of preparing representative text samples to serve as calibration data for Hessian estimation during GPTQ quantization.
Description
GPTQ requires running calibration data through the model to estimate per-layer Hessian matrices. The calibration dataset must be representative of the model's typical input distribution to ensure that the quantization error is minimized for real-world usage patterns.
The calibration data preparation involves three stages:
- Dataset selection — Standard calibration sets include
wikitext2,c4, andc4-new. Custom datasets can also be provided as a list of strings or pre-tokenized data. - Tokenization — Text samples are tokenized to a fixed sequence length (
seqlen). For standard datasets, random windows ofseqlentokens are extracted from the corpus. - Batching and padding — Tokenized samples are batched together. When
batch_size > 1, samples are padded to the longest sequence in each batch using the specifiedpad_token_id.
The number of calibration samples (nsamples) and their sequence length (seqlen) directly affect quantization quality and calibration time.
Usage
Use before running GPTQ quantization to prepare the calibration dataset. The dataset is passed to GPTQQuantizer either as a string name (for standard datasets) or as a list of strings/tokenized data.
from optimum.gptq.data import get_dataset, prepare_dataset
# Get standard calibration dataset
dataset = get_dataset("wikitext2", tokenizer, nsamples=128, seqlen=2048, split="train")
# Prepare for batched processing
dataset = prepare_dataset(dataset, batch_size=1, pad_token_id=tokenizer.pad_token_id)
Theoretical Basis
The Hessian matrix is computed as:
H = 2 * X^T * X
where X is the matrix of layer inputs across calibration samples. More diverse calibration data yields better Hessian estimates, leading to more accurate quantization. Typical configurations use 128 samples of 2048 tokens.
The quality of the Hessian estimate depends on:
- Sample count — More samples provide a better approximation of the true input distribution.
- Sample diversity — Samples should cover the range of inputs the model will encounter in practice.
- Sequence length — Longer sequences capture more context-dependent activation patterns.
Random seeding ensures reproducibility of the calibration dataset across runs.
Supported Datasets
| Dataset Name | Source | Description |
|---|---|---|
wikitext2 |
wikitext/wikitext-2-raw-v1 |
Wikipedia articles; standard GPTQ benchmark. |
c4 |
allenai/c4 |
Colossal Clean Crawled Corpus; diverse web text. |
c4-new |
allenai/c4 |
Same source as c4, alternate processing. |
Related
- implemented_by → Implementation:Huggingface_Optimum_Get_Dataset