Implementation:Huggingface Diffusers Quantization Config Classes

Metadata

Property	Value
API	`BitsAndBytesConfig(...)`, `TorchAoConfig(...)`, `QuantoConfig(...)`, `GGUFQuantizationConfig(...)`
Module	`src/diffusers/quantizers/quantization_config.py`
Lines	L70-L1051
Import	`from diffusers import BitsAndBytesConfig, TorchAoConfig, QuantoConfig`
Type	API Doc
Principle	Huggingface_Diffusers_Quantization_Configuration
Implements	Principle:Huggingface_Diffusers_Quantization_Configuration

Purpose

This module defines all quantization configuration dataclasses used in Diffusers. Each config class encapsulates the parameters for a specific quantization backend, provides validation in post_init(), and implements serialization/deserialization for round-trip persistence in model config.json files.

QuantizationConfigMixin (Base Class)

I/O Contract

Attribute	Type	Description
`quant_method`	`QuantizationMethod`	Enum identifying the backend

Key Methods

Method	Signature	Description
`from_dict`	`cls(config_dict, return_unused_kwargs=False) -> Self`	Instantiate from a dictionary
`to_dict`	`() -> dict[str, Any]`	Serialize to a dictionary
`to_json_string`	`(use_diff=True) -> str`	Serialize to JSON string
`to_json_file`	`(json_file_path) -> None`	Write config to a JSON file
`update`	`(**kwargs) -> dict[str, Any]`	Update attributes, return unused kwargs

BitsAndBytesConfig

Constructor Signature

BitsAndBytesConfig(
    load_in_8bit: bool = False,
    load_in_4bit: bool = False,
    llm_int8_threshold: float = 6.0,
    llm_int8_skip_modules: list[str] | None = None,
    llm_int8_enable_fp32_cpu_offload: bool = False,
    llm_int8_has_fp16_weight: bool = False,
    bnb_4bit_compute_dtype: torch.dtype | str | None = None,  # defaults to torch.float32
    bnb_4bit_quant_type: str = "fp4",                         # "fp4" or "nf4"
    bnb_4bit_use_double_quant: bool = False,
    bnb_4bit_quant_storage: torch.dtype | str | None = None,  # defaults to torch.uint8
)

Parameters

Parameter	Type	Default	Description
`load_in_8bit`	`bool`	`False`	Enable 8-bit quantization with LLM.int8()
`load_in_4bit`	`bool`	`False`	Enable 4-bit quantization with FP4/NF4 layers
`llm_int8_threshold`	`float`	`6.0`	Outlier threshold for mixed-precision decomposition in 8-bit mode
`llm_int8_skip_modules`	None	`None`	Module names to keep in original dtype for 8-bit mode
`bnb_4bit_compute_dtype`	str	`torch.float32`	Computation dtype for 4-bit dequantized operations
`bnb_4bit_quant_type`	`str`	`"fp4"`	Quantization data type: `"fp4"` or `"nf4"`
`bnb_4bit_use_double_quant`	`bool`	`False`	Enable nested quantization of quantization constants
`bnb_4bit_quant_storage`	str	`torch.uint8`	Storage dtype for packed 4-bit parameters

Validation Rules

load_in_4bit and load_in_8bit are mutually exclusive
4-bit quantization requires bitsandbytes >= 0.39.0
bnb_4bit_compute_dtype must be a valid torch.dtype
bnb_4bit_quant_type must be a string

Usage Example

from diffusers import BitsAndBytesConfig, FluxTransformer2DModel
import torch

# NF4 quantization with bfloat16 compute and double quantization
config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=True,
)

transformer = FluxTransformer2DModel.from_pretrained(
    "black-forest-labs/Flux.1-Dev",
    subfolder="transformer",
    quantization_config=config,
    torch_dtype=torch.bfloat16,
)

TorchAoConfig

Constructor Signature

TorchAoConfig(
    quant_type: str | AOBaseConfig,
    modules_to_not_convert: list[str] | None = None,
    **kwargs,  # forwarded as quant_type_kwargs (e.g., group_size, inner_k_tiles)
)

Parameters

Parameter	Type	Default	Description
`quant_type`	AOBaseConfig	(required)	Quantization method name or config object
`modules_to_not_convert`	None	`None`	Modules to skip during quantization
`**kwargs`	`dict`	`{}`	Backend-specific keyword arguments (stored as `quant_type_kwargs`)

Supported String Shorthands

Category	Shorthands
Integer 4-bit	`int4wo`, `int4_weight_only`, `int4dq`, `int8_dynamic_activation_int4_weight`
Integer 8-bit	`int8wo`, `int8_weight_only`, `int8dq`, `int8_dynamic_activation_int8_weight`
Float8	`float8wo`, `float8wo_e5m2`, `float8wo_e4m3`, `float8dq`, `float8dq_e4m3`
Unsigned int	`uint1wo` through `uint7wo`

Validation Rules

If quant_type is a string, it must exist in the supported types registry
If quant_type is an AOBaseConfig instance, requires torchao > 0.9.0
Float8 types require CUDA capability >= 8.9
Unsupported kwargs for a given quant_type raise ValueError

Usage Example

from diffusers import FluxTransformer2DModel, TorchAoConfig
import torch

# String-based config
quantization_config = TorchAoConfig("int8wo")

# AOBaseConfig-based config (torchao > 0.9.0)
from torchao.quantization import Int8WeightOnlyConfig
quantization_config = TorchAoConfig(Int8WeightOnlyConfig())

transformer = FluxTransformer2DModel.from_pretrained(
    "black-forest-labs/Flux.1-Dev",
    subfolder="transformer",
    quantization_config=quantization_config,
    torch_dtype=torch.bfloat16,
)

QuantoConfig

Constructor Signature

QuantoConfig(
    weights_dtype: str = "int8",
    modules_to_not_convert: list[str] | None = None,
)

Parameters

Parameter	Type	Default	Description
`weights_dtype`	`str`	`"int8"`	Target weight dtype. One of: `"float8"`, `"int8"`, `"int4"`, `"int2"`
`modules_to_not_convert`	None	`None`	Modules to skip during quantization

Validation Rules

weights_dtype must be one of ["float8", "int8", "int4", "int2"]

Usage Example

from diffusers import FluxTransformer2DModel, QuantoConfig
import torch

config = QuantoConfig(weights_dtype="int4")
transformer = FluxTransformer2DModel.from_pretrained(
    "black-forest-labs/Flux.1-Dev",
    subfolder="transformer",
    quantization_config=config,
    torch_dtype=torch.bfloat16,
)

GGUFQuantizationConfig

Constructor Signature

GGUFQuantizationConfig(
    compute_dtype: torch.dtype | None = None,  # defaults to torch.float32
)

Parameters

Parameter	Type	Default	Description
`compute_dtype`	None	`torch.float32`	Computation dtype for dequantized operations

Notes

Always sets pre_quantized = True -- GGUF only supports loading pre-quantized checkpoints
modules_to_not_convert is initialized to None internally

NVIDIAModelOptConfig

Constructor Signature

NVIDIAModelOptConfig(
    quant_type: str,                                    # e.g., "FP8", "INT8", "NF4", "FP8_INT8"
    modules_to_not_convert: list[str] | None = None,
    weight_only: bool = True,
    channel_quantize: int | None = None,
    block_quantize: int | None = None,
    scale_channel_quantize: int | None = None,
    scale_block_quantize: int | None = None,
    algorithm: str = "max",
    forward_loop: Callable | None = None,
    modelopt_config: dict | None = None,
    disable_conv_quantization: bool = False,
)

Parameters

Parameter	Type	Default	Description
`quant_type`	`str`	(required)	Format: `"WEIGHT"` or `"WEIGHT_ACTIVATION"` (e.g., `"FP8"`, `"FP8_INT8"`)
`weight_only`	`bool`	`True`	If True, only quantize weights (disable activation quantizers)
`algorithm`	`str`	`"max"`	Calibration algorithm
`forward_loop`	None	`None`	Forward loop function for calibration

Supported Quant Types

Type	Num Bits	Description
`FP8`	(4, 3)	8-bit floating point
`INT8`	8	8-bit integer
`INT4`	4	4-bit integer
`NF4`	4	NormalFloat 4-bit (with 8-bit scales)
`NVFP4`	(2, 1)	NVIDIA 4-bit float (with FP8 scales)

Serialization Round-Trip

All config classes support full round-trip serialization:

# Serialize to dict (stored in config.json)
config = BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_quant_type="nf4")
config_dict = config.to_dict()
# {"quant_method": "bitsandbytes", "load_in_4bit": true, "bnb_4bit_quant_type": "nf4", ...}

# Deserialize back
restored = BitsAndBytesConfig.from_dict(config_dict)

TorchAO handles AOBaseConfig serialization through torchao.core.config.config_to_dict and config_from_dict, wrapping the config in a {"default": {...}} dict structure.

Implementation Notes

QuantizationMethod enum: Uses str, Enum inheritance so that values are plain strings (e.g., "bitsandbytes", "torchao") for JSON compatibility.
BnB dtype coercion: BitsAndBytesConfig accepts string dtype names (e.g., "bfloat16") and converts them to torch.dtype via getattr(torch, name).
TorchAO quant_type_kwargs: Extra kwargs passed to TorchAoConfig.__init__ are stored as quant_type_kwargs and later passed to the TorchAO quantization function.
ModelOpt quant_type normalization: The _normalize_quant_type method splits the string on "_" to extract weight and activation types, falling back to defaults for unsupported values.

Related Pages

Huggingface_Diffusers_Quantization_Configuration - Theoretical foundation of quantization parameters
Huggingface_Diffusers_DiffusersAutoQuantizer_From_Config - How configs are dispatched to quantizers
Huggingface_Diffusers_ModelMixin_From_Pretrained_Quantized - How configs are consumed during model loading

Requires Environment

Environment:Huggingface_Diffusers_Quantization_Environment

Source References

src/diffusers/quantizers/quantization_config.py:L48-L54 - QuantizationMethod enum
src/diffusers/quantizers/quantization_config.py:L66-L178 - QuantizationConfigMixin
src/diffusers/quantizers/quantization_config.py:L181-L418 - BitsAndBytesConfig
src/diffusers/quantizers/quantization_config.py:L444-L824 - TorchAoConfig
src/diffusers/quantizers/quantization_config.py:L827-L859 - QuantoConfig
src/diffusers/quantizers/quantization_config.py:L421-L441 - GGUFQuantizationConfig
src/diffusers/quantizers/quantization_config.py:L862-L1051 - NVIDIAModelOptConfig

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment