Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Environment:Huggingface Optimum GPTQ Quantization Environment

From Leeroopedia
Knowledge Sources
Domains Quantization, GPU_Acceleration
Last Updated 2026-02-15 00:00 GMT

Overview

GPU-accelerated (CUDA/XPU) or Intel IPEX CPU environment with `gptqmodel` >= 1.6.0 and `accelerate` for GPTQ weight quantization and loading.

Description

This environment provides the dependencies needed for GPTQ quantization of large language models. It requires the `gptqmodel` package (successor to the deprecated `auto-gptq`) for the actual quantization algorithm, and `accelerate` for device dispatch and checkpoint loading. Hardware-wise, either an NVIDIA GPU (CUDA), Intel GPU (XPU), or Intel CPU with IPEX support is required. The `datasets` package is also needed for loading calibration data (wikitext2, c4, c4-new).

Usage

Use this environment when performing GPTQ model quantization (the `GPTQQuantizer.quantize_model()` workflow) or loading pre-quantized GPTQ models (`load_quantized_model()`). This is the mandatory prerequisite for all GPTQ-related Implementations.

System Requirements

Category Requirement Notes
OS Linux recommended CUDA/ROCm drivers required for GPU
Hardware NVIDIA GPU (CUDA) OR Intel GPU (XPU) OR Intel CPU with IPEX Minimum GPU for quantization; CPU supported only via gptqmodel+IPEX
Disk Sufficient for model + calibration data Models can be 2-70GB+ depending on size

Dependencies

Required Packages

  • `gptqmodel` >= 1.6.0 (mandatory for quantization and loading)
  • `accelerate` (mandatory for model loading and device dispatch)
  • `torch` >= 2.1.0 (core dependency)
  • `transformers` >= 4.36.0 (AutoTokenizer, model configs)
  • `datasets` (required for calibration data loading: wikitext2, c4, c4-new)
  • `tqdm` (progress bars during quantization)

Deprecated Packages

  • `auto-gptq` >= 0.4.99 (deprecated, being replaced by `gptqmodel`)

Credentials

No credentials required for GPTQ quantization itself. Model access may require:

  • `HF_TOKEN`: HuggingFace API token if quantizing gated models (e.g., Llama).

Quick Install

# Install gptqmodel (replaces deprecated auto-gptq)
pip install gptqmodel

# Install all required packages
pip install optimum gptqmodel accelerate datasets torch>=2.1.0 transformers>=4.36.0

Code Evidence

GPTQModel requirement from `optimum/gptq/quantizer.py:379-382`:

if not is_gptqmodel_available():
    raise RuntimeError(
        "gptqmodel is required in order to perform gptq quantization: "
        "`pip install gptqmodel`. Please notice that auto-gptq will be "
        "deprecated in the future."
    )

GPU/CPU hardware requirement from `optimum/gptq/quantizer.py:384-389`:

gptq_supports_cpu = is_gptqmodel_available()

if not gptq_supports_cpu and not torch.cuda.is_available():
    raise RuntimeError(
        "No cuda gpu or cpu support using Intel/IPEX found. "
        "A gpu or cpu with Intel/IPEX is required for quantization."
    )

Hardware detection function from `optimum/gptq/quantizer.py:60-61`:

def has_device_more_than_cpu():
    return torch.cuda.is_available() or (hasattr(torch, "xpu") and torch.xpu.is_available())

Accelerate requirement for loading from `optimum/gptq/quantizer.py:809-813`:

if not is_accelerate_available():
    raise RuntimeError(
        "You need to install accelerate in order to load and dispatch weights to"
        "a quantized model. You can do it with `pip install accelerate`"
    )

GPTQModel version validation from `optimum/utils/import_utils.py:222-230`:

def is_gptqmodel_available():
    if _gptqmodel_available:
        v = version.parse(importlib.metadata.version("gptqmodel"))
        if v >= GPTQMODEL_MINIMUM_VERSION:
            return True
        else:
            raise ImportError(
                f"Found an incompatible version of gptqmodel. Found version {v}, "
                f"but only version >= {GPTQMODEL_MINIMUM_VERSION} are supported"
            )

Common Errors

Error Message Cause Solution
`gptqmodel is required in order to perform gptq quantization` gptqmodel not installed `pip install gptqmodel`
`No cuda gpu or cpu support using Intel/IPEX found` No GPU and no Intel IPEX CPU support Install CUDA drivers or Intel IPEX
`Asymmetric sym=False quantization is not supported with auto-gptq` Using deprecated auto-gptq with asymmetric quant Switch to gptqmodel: `pip install gptqmodel`
`gptq_v2 format only supported with gptqmodel` GPTQ v2 format requires gptqmodel `pip install gptqmodel`
`disk offload is not supported with GPTQ quantization` Model has disk offload in device map Remove disk offload; use GPU/CPU only
`You need to install accelerate` accelerate not installed for model loading `pip install accelerate`
`Found an incompatible version of gptqmodel` gptqmodel version below 1.6.0 `pip install -U gptqmodel>=1.6.0`

Compatibility Notes

  • CUDA GPUs: Full support for NVIDIA GPUs via CUDA. Cache is explicitly cleared with `torch.cuda.empty_cache()` during quantization.
  • Intel XPU (Arc GPUs): Supported via `torch.xpu.is_available()` check. Cache cleared with `torch.xpu.empty_cache()`.
  • CPU-only: Only supported when using `gptqmodel` (not auto-gptq) with Intel IPEX.
  • auto-gptq deprecation: `auto-gptq` is deprecated. The codebase enforces `gptqmodel` for new features (asymmetric quantization, GPTQ v2 format).
  • Device maps: Disk offload is explicitly blocked. CPU offload with multiple devices triggers a warning about potential memory issues.
  • PTB datasets: The `ptb` and `ptb-new` calibration datasets are deprecated and raise `RuntimeError` if used.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment