Metadata
Purpose
This module defines all quantization configuration dataclasses used in Diffusers. Each config class encapsulates the parameters for a specific quantization backend, provides validation in post_init(), and implements serialization/deserialization for round-trip persistence in model config.json files.
QuantizationConfigMixin (Base Class)
I/O Contract
| Attribute |
Type |
Description
|
quant_method |
QuantizationMethod |
Enum identifying the backend
|
Key Methods
| Method |
Signature |
Description
|
from_dict |
cls(config_dict, return_unused_kwargs=False) -> Self |
Instantiate from a dictionary
|
to_dict |
() -> dict[str, Any] |
Serialize to a dictionary
|
to_json_string |
(use_diff=True) -> str |
Serialize to JSON string
|
to_json_file |
(json_file_path) -> None |
Write config to a JSON file
|
update |
(**kwargs) -> dict[str, Any] |
Update attributes, return unused kwargs
|
BitsAndBytesConfig
Constructor Signature
BitsAndBytesConfig(
load_in_8bit: bool = False,
load_in_4bit: bool = False,
llm_int8_threshold: float = 6.0,
llm_int8_skip_modules: list[str] | None = None,
llm_int8_enable_fp32_cpu_offload: bool = False,
llm_int8_has_fp16_weight: bool = False,
bnb_4bit_compute_dtype: torch.dtype | str | None = None, # defaults to torch.float32
bnb_4bit_quant_type: str = "fp4", # "fp4" or "nf4"
bnb_4bit_use_double_quant: bool = False,
bnb_4bit_quant_storage: torch.dtype | str | None = None, # defaults to torch.uint8
)
Parameters
| Parameter |
Type |
Default |
Description
|
load_in_8bit |
bool |
False |
Enable 8-bit quantization with LLM.int8()
|
load_in_4bit |
bool |
False |
Enable 4-bit quantization with FP4/NF4 layers
|
llm_int8_threshold |
float |
6.0 |
Outlier threshold for mixed-precision decomposition in 8-bit mode
|
llm_int8_skip_modules |
None |
None |
Module names to keep in original dtype for 8-bit mode
|
bnb_4bit_compute_dtype |
str |
torch.float32 |
Computation dtype for 4-bit dequantized operations
|
bnb_4bit_quant_type |
str |
"fp4" |
Quantization data type: "fp4" or "nf4"
|
bnb_4bit_use_double_quant |
bool |
False |
Enable nested quantization of quantization constants
|
bnb_4bit_quant_storage |
str |
torch.uint8 |
Storage dtype for packed 4-bit parameters
|
Validation Rules
load_in_4bit and load_in_8bit are mutually exclusive
- 4-bit quantization requires
bitsandbytes >= 0.39.0
bnb_4bit_compute_dtype must be a valid torch.dtype
bnb_4bit_quant_type must be a string
Usage Example
from diffusers import BitsAndBytesConfig, FluxTransformer2DModel
import torch
# NF4 quantization with bfloat16 compute and double quantization
config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16,
bnb_4bit_use_double_quant=True,
)
transformer = FluxTransformer2DModel.from_pretrained(
"black-forest-labs/Flux.1-Dev",
subfolder="transformer",
quantization_config=config,
torch_dtype=torch.bfloat16,
)
TorchAoConfig
Constructor Signature
TorchAoConfig(
quant_type: str | AOBaseConfig,
modules_to_not_convert: list[str] | None = None,
**kwargs, # forwarded as quant_type_kwargs (e.g., group_size, inner_k_tiles)
)
Parameters
| Parameter |
Type |
Default |
Description
|
quant_type |
AOBaseConfig |
(required) |
Quantization method name or config object
|
modules_to_not_convert |
None |
None |
Modules to skip during quantization
|
**kwargs |
dict |
{} |
Backend-specific keyword arguments (stored as quant_type_kwargs)
|
Supported String Shorthands
| Category |
Shorthands
|
| Integer 4-bit |
int4wo, int4_weight_only, int4dq, int8_dynamic_activation_int4_weight
|
| Integer 8-bit |
int8wo, int8_weight_only, int8dq, int8_dynamic_activation_int8_weight
|
| Float8 |
float8wo, float8wo_e5m2, float8wo_e4m3, float8dq, float8dq_e4m3
|
| Unsigned int |
uint1wo through uint7wo
|
Validation Rules
- If
quant_type is a string, it must exist in the supported types registry
- If
quant_type is an AOBaseConfig instance, requires torchao > 0.9.0
- Float8 types require CUDA capability >= 8.9
- Unsupported kwargs for a given quant_type raise
ValueError
Usage Example
from diffusers import FluxTransformer2DModel, TorchAoConfig
import torch
# String-based config
quantization_config = TorchAoConfig("int8wo")
# AOBaseConfig-based config (torchao > 0.9.0)
from torchao.quantization import Int8WeightOnlyConfig
quantization_config = TorchAoConfig(Int8WeightOnlyConfig())
transformer = FluxTransformer2DModel.from_pretrained(
"black-forest-labs/Flux.1-Dev",
subfolder="transformer",
quantization_config=quantization_config,
torch_dtype=torch.bfloat16,
)
QuantoConfig
Constructor Signature
QuantoConfig(
weights_dtype: str = "int8",
modules_to_not_convert: list[str] | None = None,
)
Parameters
| Parameter |
Type |
Default |
Description
|
weights_dtype |
str |
"int8" |
Target weight dtype. One of: "float8", "int8", "int4", "int2"
|
modules_to_not_convert |
None |
None |
Modules to skip during quantization
|
Validation Rules
weights_dtype must be one of ["float8", "int8", "int4", "int2"]
Usage Example
from diffusers import FluxTransformer2DModel, QuantoConfig
import torch
config = QuantoConfig(weights_dtype="int4")
transformer = FluxTransformer2DModel.from_pretrained(
"black-forest-labs/Flux.1-Dev",
subfolder="transformer",
quantization_config=config,
torch_dtype=torch.bfloat16,
)
GGUFQuantizationConfig
Constructor Signature
GGUFQuantizationConfig(
compute_dtype: torch.dtype | None = None, # defaults to torch.float32
)
Parameters
| Parameter |
Type |
Default |
Description
|
compute_dtype |
None |
torch.float32 |
Computation dtype for dequantized operations
|
Notes
- Always sets
pre_quantized = True -- GGUF only supports loading pre-quantized checkpoints
modules_to_not_convert is initialized to None internally
NVIDIAModelOptConfig
Constructor Signature
NVIDIAModelOptConfig(
quant_type: str, # e.g., "FP8", "INT8", "NF4", "FP8_INT8"
modules_to_not_convert: list[str] | None = None,
weight_only: bool = True,
channel_quantize: int | None = None,
block_quantize: int | None = None,
scale_channel_quantize: int | None = None,
scale_block_quantize: int | None = None,
algorithm: str = "max",
forward_loop: Callable | None = None,
modelopt_config: dict | None = None,
disable_conv_quantization: bool = False,
)
Parameters
| Parameter |
Type |
Default |
Description
|
quant_type |
str |
(required) |
Format: "WEIGHT" or "WEIGHT_ACTIVATION" (e.g., "FP8", "FP8_INT8")
|
weight_only |
bool |
True |
If True, only quantize weights (disable activation quantizers)
|
algorithm |
str |
"max" |
Calibration algorithm
|
forward_loop |
None |
None |
Forward loop function for calibration
|
Supported Quant Types
| Type |
Num Bits |
Description
|
FP8 |
(4, 3) |
8-bit floating point
|
INT8 |
8 |
8-bit integer
|
INT4 |
4 |
4-bit integer
|
NF4 |
4 |
NormalFloat 4-bit (with 8-bit scales)
|
NVFP4 |
(2, 1) |
NVIDIA 4-bit float (with FP8 scales)
|
Serialization Round-Trip
All config classes support full round-trip serialization:
# Serialize to dict (stored in config.json)
config = BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_quant_type="nf4")
config_dict = config.to_dict()
# {"quant_method": "bitsandbytes", "load_in_4bit": true, "bnb_4bit_quant_type": "nf4", ...}
# Deserialize back
restored = BitsAndBytesConfig.from_dict(config_dict)
TorchAO handles AOBaseConfig serialization through torchao.core.config.config_to_dict and config_from_dict, wrapping the config in a {"default": {...}} dict structure.
Implementation Notes
- QuantizationMethod enum: Uses
str, Enum inheritance so that values are plain strings (e.g., "bitsandbytes", "torchao") for JSON compatibility.
- BnB dtype coercion:
BitsAndBytesConfig accepts string dtype names (e.g., "bfloat16") and converts them to torch.dtype via getattr(torch, name).
- TorchAO quant_type_kwargs: Extra kwargs passed to
TorchAoConfig.__init__ are stored as quant_type_kwargs and later passed to the TorchAO quantization function.
- ModelOpt quant_type normalization: The
_normalize_quant_type method splits the string on "_" to extract weight and activation types, falling back to defaults for unsupported values.
Related Pages
Requires Environment
Source References
src/diffusers/quantizers/quantization_config.py:L48-L54 - QuantizationMethod enum
src/diffusers/quantizers/quantization_config.py:L66-L178 - QuantizationConfigMixin
src/diffusers/quantizers/quantization_config.py:L181-L418 - BitsAndBytesConfig
src/diffusers/quantizers/quantization_config.py:L444-L824 - TorchAoConfig
src/diffusers/quantizers/quantization_config.py:L827-L859 - QuantoConfig
src/diffusers/quantizers/quantization_config.py:L421-L441 - GGUFQuantizationConfig
src/diffusers/quantizers/quantization_config.py:L862-L1051 - NVIDIAModelOptConfig