Environment:Huggingface Optimum GPTQ Quantization Environment
| Knowledge Sources | |
|---|---|
| Domains | Quantization, GPU_Acceleration |
| Last Updated | 2026-02-15 00:00 GMT |
Overview
GPU-accelerated (CUDA/XPU) or Intel IPEX CPU environment with `gptqmodel` >= 1.6.0 and `accelerate` for GPTQ weight quantization and loading.
Description
This environment provides the dependencies needed for GPTQ quantization of large language models. It requires the `gptqmodel` package (successor to the deprecated `auto-gptq`) for the actual quantization algorithm, and `accelerate` for device dispatch and checkpoint loading. Hardware-wise, either an NVIDIA GPU (CUDA), Intel GPU (XPU), or Intel CPU with IPEX support is required. The `datasets` package is also needed for loading calibration data (wikitext2, c4, c4-new).
Usage
Use this environment when performing GPTQ model quantization (the `GPTQQuantizer.quantize_model()` workflow) or loading pre-quantized GPTQ models (`load_quantized_model()`). This is the mandatory prerequisite for all GPTQ-related Implementations.
System Requirements
| Category | Requirement | Notes |
|---|---|---|
| OS | Linux recommended | CUDA/ROCm drivers required for GPU |
| Hardware | NVIDIA GPU (CUDA) OR Intel GPU (XPU) OR Intel CPU with IPEX | Minimum GPU for quantization; CPU supported only via gptqmodel+IPEX |
| Disk | Sufficient for model + calibration data | Models can be 2-70GB+ depending on size |
Dependencies
Required Packages
- `gptqmodel` >= 1.6.0 (mandatory for quantization and loading)
- `accelerate` (mandatory for model loading and device dispatch)
- `torch` >= 2.1.0 (core dependency)
- `transformers` >= 4.36.0 (AutoTokenizer, model configs)
- `datasets` (required for calibration data loading: wikitext2, c4, c4-new)
- `tqdm` (progress bars during quantization)
Deprecated Packages
- `auto-gptq` >= 0.4.99 (deprecated, being replaced by `gptqmodel`)
Credentials
No credentials required for GPTQ quantization itself. Model access may require:
- `HF_TOKEN`: HuggingFace API token if quantizing gated models (e.g., Llama).
Quick Install
# Install gptqmodel (replaces deprecated auto-gptq)
pip install gptqmodel
# Install all required packages
pip install optimum gptqmodel accelerate datasets torch>=2.1.0 transformers>=4.36.0
Code Evidence
GPTQModel requirement from `optimum/gptq/quantizer.py:379-382`:
if not is_gptqmodel_available():
raise RuntimeError(
"gptqmodel is required in order to perform gptq quantization: "
"`pip install gptqmodel`. Please notice that auto-gptq will be "
"deprecated in the future."
)
GPU/CPU hardware requirement from `optimum/gptq/quantizer.py:384-389`:
gptq_supports_cpu = is_gptqmodel_available()
if not gptq_supports_cpu and not torch.cuda.is_available():
raise RuntimeError(
"No cuda gpu or cpu support using Intel/IPEX found. "
"A gpu or cpu with Intel/IPEX is required for quantization."
)
Hardware detection function from `optimum/gptq/quantizer.py:60-61`:
def has_device_more_than_cpu():
return torch.cuda.is_available() or (hasattr(torch, "xpu") and torch.xpu.is_available())
Accelerate requirement for loading from `optimum/gptq/quantizer.py:809-813`:
if not is_accelerate_available():
raise RuntimeError(
"You need to install accelerate in order to load and dispatch weights to"
"a quantized model. You can do it with `pip install accelerate`"
)
GPTQModel version validation from `optimum/utils/import_utils.py:222-230`:
def is_gptqmodel_available():
if _gptqmodel_available:
v = version.parse(importlib.metadata.version("gptqmodel"))
if v >= GPTQMODEL_MINIMUM_VERSION:
return True
else:
raise ImportError(
f"Found an incompatible version of gptqmodel. Found version {v}, "
f"but only version >= {GPTQMODEL_MINIMUM_VERSION} are supported"
)
Common Errors
| Error Message | Cause | Solution |
|---|---|---|
| `gptqmodel is required in order to perform gptq quantization` | gptqmodel not installed | `pip install gptqmodel` |
| `No cuda gpu or cpu support using Intel/IPEX found` | No GPU and no Intel IPEX CPU support | Install CUDA drivers or Intel IPEX |
| `Asymmetric sym=False quantization is not supported with auto-gptq` | Using deprecated auto-gptq with asymmetric quant | Switch to gptqmodel: `pip install gptqmodel` |
| `gptq_v2 format only supported with gptqmodel` | GPTQ v2 format requires gptqmodel | `pip install gptqmodel` |
| `disk offload is not supported with GPTQ quantization` | Model has disk offload in device map | Remove disk offload; use GPU/CPU only |
| `You need to install accelerate` | accelerate not installed for model loading | `pip install accelerate` |
| `Found an incompatible version of gptqmodel` | gptqmodel version below 1.6.0 | `pip install -U gptqmodel>=1.6.0` |
Compatibility Notes
- CUDA GPUs: Full support for NVIDIA GPUs via CUDA. Cache is explicitly cleared with `torch.cuda.empty_cache()` during quantization.
- Intel XPU (Arc GPUs): Supported via `torch.xpu.is_available()` check. Cache cleared with `torch.xpu.empty_cache()`.
- CPU-only: Only supported when using `gptqmodel` (not auto-gptq) with Intel IPEX.
- auto-gptq deprecation: `auto-gptq` is deprecated. The codebase enforces `gptqmodel` for new features (asymmetric quantization, GPTQ v2 format).
- Device maps: Disk offload is explicitly blocked. CPU offload with multiple devices triggers a warning about potential memory issues.
- PTB datasets: The `ptb` and `ptb-new` calibration datasets are deprecated and raise `RuntimeError` if used.
Related Pages
- Implementation:Huggingface_Optimum_GPTQQuantizer_Init
- Implementation:Huggingface_Optimum_GPTQQuantizer_Convert_Model
- Implementation:Huggingface_Optimum_Store_Input_Hook
- Implementation:Huggingface_Optimum_GPTQ_Fasterquant
- Implementation:Huggingface_Optimum_GPTQQuantizer_Pack_Model
- Implementation:Huggingface_Optimum_GPTQQuantizer_Post_Init
- Implementation:Huggingface_Optimum_Get_Dataset