Environment:Huggingface Diffusers Quantization Environment
| Knowledge Sources | |
|---|---|
| Domains | Quantization, Optimization |
| Last Updated | 2026-02-13 21:00 GMT |
Overview
Quantization backend environment for Diffusers: supports BitsAndBytes (NF4/INT8), TorchAO, Optimum Quanto, GGUF, and NVIDIA ModelOpt — all requiring a CUDA or XPU GPU.
Description
This environment provides the quantization backends supported by Diffusers for reducing model memory footprint. The library uses a unified DiffusersAutoQuantizer that dispatches to the appropriate backend based on the configuration class. BitsAndBytes is the most mature backend for 4-bit (NF4) and 8-bit (INT8) quantization. TorchAO provides PyTorch-native quantization with extended dtype support requiring PyTorch >= 2.5. GGUF enables loading pre-quantized models from the llama.cpp ecosystem. All quantization backends except GGUF require a CUDA or XPU GPU — CPU-only quantization is not supported.
Usage
Required for Model Quantization workflow and any pipeline that loads quantized models via `quantization_config` parameter. Use when GPU memory is limited and you need to fit large models (e.g., Flux at 12B parameters) into consumer GPUs.
System Requirements
| Category | Requirement | Notes |
|---|---|---|
| OS | Linux (recommended) | BitsAndBytes CUDA support is Linux-first |
| Hardware | NVIDIA GPU (CUDA) or Intel XPU | Required — CPU-only quantization raises RuntimeError |
| VRAM | 8GB+ | 4-bit quantization reduces model size by ~75% |
Dependencies
Backend-Specific Packages
BitsAndBytes (NF4/INT8):
- `bitsandbytes` >= 0.43.3
- `accelerate` >= 0.26.0
TorchAO:
- `torchao` >= 0.7.0
- `torch` >= 2.5.0 (for extended dtype support)
- `torch` >= 2.6.0 (for safe globals in serialization)
Optimum Quanto:
- `optimum_quanto` >= 0.2.6
GGUF:
- `gguf` >= 0.10.0
NVIDIA ModelOpt:
- `nvidia_modelopt[hf]` >= 0.33.1
Credentials
No additional credentials required beyond the base environment.
Quick Install
# BitsAndBytes quantization (most common)
pip install diffusers[bitsandbytes] transformers accelerate
# TorchAO quantization
pip install diffusers[torchao] transformers accelerate
# GGUF support
pip install diffusers[gguf] transformers accelerate
# All quantization backends
pip install diffusers transformers accelerate bitsandbytes torchao optimum-quanto gguf
Code Evidence
GPU requirement validation from `bnb_quantizer.py:63-73`:
def validate_environment(self, *args, **kwargs):
if not (torch.cuda.is_available() or torch.xpu.is_available()):
raise RuntimeError("No GPU found. A GPU is needed for quantization.")
if not is_accelerate_available() or is_accelerate_version("<", "0.26.0"):
raise ImportError(
"Using `bitsandbytes` 4-bit quantization requires Accelerate: "
"`pip install 'accelerate>=0.26.0'`"
)
if not is_bitsandbytes_available() or is_bitsandbytes_version("<", "0.43.3"):
raise ImportError(
"Using `bitsandbytes` 4-bit quantization requires the latest version "
"of bitsandbytes: `pip install -U bitsandbytes`"
)
TorchAO PyTorch version gates from `torchao_quantizer.py:50-65`:
# PyTorch >= 2.5 for extended dtypes
_TORCHAO_SUPPORT_EXTENDED_DTYPES = is_torch_version(">=", "2.5")
# PyTorch >= 2.6.0 for safe globals serialization
if is_torch_version(">=", "2.6.0"):
torch.serialization.add_safe_globals([...])
GGUF CUDA kernel environment variable from `quantizers/gguf/utils.py:33`:
DIFFUSERS_GGUF_CUDA_KERNELS = os.getenv("DIFFUSERS_GGUF_CUDA_KERNELS", "false")
Common Errors
| Error Message | Cause | Solution |
|---|---|---|
| `RuntimeError: No GPU found. A GPU is needed for quantization.` | No CUDA/XPU GPU available | Use a machine with an NVIDIA or Intel GPU |
| `ImportError: Using bitsandbytes 4-bit quantization requires Accelerate >= 0.26.0` | Old accelerate | `pip install -U accelerate` |
| `ImportError: Using bitsandbytes 4-bit quantization requires the latest version of bitsandbytes` | bitsandbytes < 0.43.3 | `pip install -U bitsandbytes` |
| `Converting into 4-bit weights from flax weights is currently not supported` | Attempting to quantize Flax model | Convert to PyTorch format first |
Compatibility Notes
- BitsAndBytes: Linux-first; Windows support is experimental. Requires CUDA GPU.
- TorchAO: PyTorch-native; broadest dtype support with PyTorch >= 2.5. Pre-quantized model loading requires PyTorch >= 2.5.0.
- GGUF: Can load pre-quantized GGUF files. Optional CUDA kernel acceleration via `DIFFUSERS_GGUF_CUDA_KERNELS=true`.
- Optimum Quanto: Framework-agnostic quantization from HuggingFace.
- Pipeline-level quantization: Use `PipelineQuantizationConfig` to quantize different components with different backends.
Related Pages
- Implementation:Huggingface_Diffusers_DiffusersAutoQuantizer_From_Config
- Implementation:Huggingface_Diffusers_Quantization_Config_Classes
- Implementation:Huggingface_Diffusers_ModelMixin_From_Pretrained_Quantized
- Implementation:Huggingface_Diffusers_PipelineQuantizationConfig
- Implementation:Huggingface_Diffusers_Quantized_Pipeline_Call
- Implementation:Huggingface_Diffusers_Save_Pretrained_Quantized