Environment:Huggingface Diffusers Quantization Environment

Knowledge Sources	Huggingface Diffusers BitsAndBytes
Domains	Quantization, Optimization
Last Updated	2026-02-13 21:00 GMT

Overview

Quantization backend environment for Diffusers: supports BitsAndBytes (NF4/INT8), TorchAO, Optimum Quanto, GGUF, and NVIDIA ModelOpt — all requiring a CUDA or XPU GPU.

Description

This environment provides the quantization backends supported by Diffusers for reducing model memory footprint. The library uses a unified DiffusersAutoQuantizer that dispatches to the appropriate backend based on the configuration class. BitsAndBytes is the most mature backend for 4-bit (NF4) and 8-bit (INT8) quantization. TorchAO provides PyTorch-native quantization with extended dtype support requiring PyTorch >= 2.5. GGUF enables loading pre-quantized models from the llama.cpp ecosystem. All quantization backends except GGUF require a CUDA or XPU GPU — CPU-only quantization is not supported.

Usage

Required for Model Quantization workflow and any pipeline that loads quantized models via `quantization_config` parameter. Use when GPU memory is limited and you need to fit large models (e.g., Flux at 12B parameters) into consumer GPUs.

System Requirements

Category	Requirement	Notes
OS	Linux (recommended)	BitsAndBytes CUDA support is Linux-first
Hardware	NVIDIA GPU (CUDA) or Intel XPU	Required — CPU-only quantization raises RuntimeError
VRAM	8GB+	4-bit quantization reduces model size by ~75%

Dependencies

Backend-Specific Packages

BitsAndBytes (NF4/INT8):

`bitsandbytes` >= 0.43.3
`accelerate` >= 0.26.0

TorchAO:

`torchao` >= 0.7.0
`torch` >= 2.5.0 (for extended dtype support)
`torch` >= 2.6.0 (for safe globals in serialization)

Optimum Quanto:

`optimum_quanto` >= 0.2.6

GGUF:

`gguf` >= 0.10.0

NVIDIA ModelOpt:

`nvidia_modelopt[hf]` >= 0.33.1

Credentials

No additional credentials required beyond the base environment.

Quick Install

# BitsAndBytes quantization (most common)
pip install diffusers[bitsandbytes] transformers accelerate

# TorchAO quantization
pip install diffusers[torchao] transformers accelerate

# GGUF support
pip install diffusers[gguf] transformers accelerate

# All quantization backends
pip install diffusers transformers accelerate bitsandbytes torchao optimum-quanto gguf

Code Evidence

GPU requirement validation from `bnb_quantizer.py:63-73`:

def validate_environment(self, *args, **kwargs):
    if not (torch.cuda.is_available() or torch.xpu.is_available()):
        raise RuntimeError("No GPU found. A GPU is needed for quantization.")
    if not is_accelerate_available() or is_accelerate_version("<", "0.26.0"):
        raise ImportError(
            "Using `bitsandbytes` 4-bit quantization requires Accelerate: "
            "`pip install 'accelerate>=0.26.0'`"
        )
    if not is_bitsandbytes_available() or is_bitsandbytes_version("<", "0.43.3"):
        raise ImportError(
            "Using `bitsandbytes` 4-bit quantization requires the latest version "
            "of bitsandbytes: `pip install -U bitsandbytes`"
        )

TorchAO PyTorch version gates from `torchao_quantizer.py:50-65`:

# PyTorch >= 2.5 for extended dtypes
_TORCHAO_SUPPORT_EXTENDED_DTYPES = is_torch_version(">=", "2.5")
# PyTorch >= 2.6.0 for safe globals serialization
if is_torch_version(">=", "2.6.0"):
    torch.serialization.add_safe_globals([...])

GGUF CUDA kernel environment variable from `quantizers/gguf/utils.py:33`:

DIFFUSERS_GGUF_CUDA_KERNELS = os.getenv("DIFFUSERS_GGUF_CUDA_KERNELS", "false")

Common Errors

Error Message	Cause	Solution
`RuntimeError: No GPU found. A GPU is needed for quantization.`	No CUDA/XPU GPU available	Use a machine with an NVIDIA or Intel GPU
`ImportError: Using bitsandbytes 4-bit quantization requires Accelerate >= 0.26.0`	Old accelerate	`pip install -U accelerate`
`ImportError: Using bitsandbytes 4-bit quantization requires the latest version of bitsandbytes`	bitsandbytes < 0.43.3	`pip install -U bitsandbytes`
`Converting into 4-bit weights from flax weights is currently not supported`	Attempting to quantize Flax model	Convert to PyTorch format first

Compatibility Notes

BitsAndBytes: Linux-first; Windows support is experimental. Requires CUDA GPU.
TorchAO: PyTorch-native; broadest dtype support with PyTorch >= 2.5. Pre-quantized model loading requires PyTorch >= 2.5.0.
GGUF: Can load pre-quantized GGUF files. Optional CUDA kernel acceleration via `DIFFUSERS_GGUF_CUDA_KERNELS=true`.
Optimum Quanto: Framework-agnostic quantization from HuggingFace.
Pipeline-level quantization: Use `PipelineQuantizationConfig` to quantize different components with different backends.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment