Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Huggingface Diffusers Quantization Config Classes

From Leeroopedia

Metadata

Property Value
API BitsAndBytesConfig(...), TorchAoConfig(...), QuantoConfig(...), GGUFQuantizationConfig(...)
Module src/diffusers/quantizers/quantization_config.py
Lines L70-L1051
Import from diffusers import BitsAndBytesConfig, TorchAoConfig, QuantoConfig
Type API Doc
Principle Huggingface_Diffusers_Quantization_Configuration
Implements Principle:Huggingface_Diffusers_Quantization_Configuration

Purpose

This module defines all quantization configuration dataclasses used in Diffusers. Each config class encapsulates the parameters for a specific quantization backend, provides validation in post_init(), and implements serialization/deserialization for round-trip persistence in model config.json files.

QuantizationConfigMixin (Base Class)

I/O Contract

Attribute Type Description
quant_method QuantizationMethod Enum identifying the backend

Key Methods

Method Signature Description
from_dict cls(config_dict, return_unused_kwargs=False) -> Self Instantiate from a dictionary
to_dict () -> dict[str, Any] Serialize to a dictionary
to_json_string (use_diff=True) -> str Serialize to JSON string
to_json_file (json_file_path) -> None Write config to a JSON file
update (**kwargs) -> dict[str, Any] Update attributes, return unused kwargs

BitsAndBytesConfig

Constructor Signature

BitsAndBytesConfig(
    load_in_8bit: bool = False,
    load_in_4bit: bool = False,
    llm_int8_threshold: float = 6.0,
    llm_int8_skip_modules: list[str] | None = None,
    llm_int8_enable_fp32_cpu_offload: bool = False,
    llm_int8_has_fp16_weight: bool = False,
    bnb_4bit_compute_dtype: torch.dtype | str | None = None,  # defaults to torch.float32
    bnb_4bit_quant_type: str = "fp4",                         # "fp4" or "nf4"
    bnb_4bit_use_double_quant: bool = False,
    bnb_4bit_quant_storage: torch.dtype | str | None = None,  # defaults to torch.uint8
)

Parameters

Parameter Type Default Description
load_in_8bit bool False Enable 8-bit quantization with LLM.int8()
load_in_4bit bool False Enable 4-bit quantization with FP4/NF4 layers
llm_int8_threshold float 6.0 Outlier threshold for mixed-precision decomposition in 8-bit mode
llm_int8_skip_modules None None Module names to keep in original dtype for 8-bit mode
bnb_4bit_compute_dtype str torch.float32 Computation dtype for 4-bit dequantized operations
bnb_4bit_quant_type str "fp4" Quantization data type: "fp4" or "nf4"
bnb_4bit_use_double_quant bool False Enable nested quantization of quantization constants
bnb_4bit_quant_storage str torch.uint8 Storage dtype for packed 4-bit parameters

Validation Rules

  • load_in_4bit and load_in_8bit are mutually exclusive
  • 4-bit quantization requires bitsandbytes >= 0.39.0
  • bnb_4bit_compute_dtype must be a valid torch.dtype
  • bnb_4bit_quant_type must be a string

Usage Example

from diffusers import BitsAndBytesConfig, FluxTransformer2DModel
import torch

# NF4 quantization with bfloat16 compute and double quantization
config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=True,
)

transformer = FluxTransformer2DModel.from_pretrained(
    "black-forest-labs/Flux.1-Dev",
    subfolder="transformer",
    quantization_config=config,
    torch_dtype=torch.bfloat16,
)

TorchAoConfig

Constructor Signature

TorchAoConfig(
    quant_type: str | AOBaseConfig,
    modules_to_not_convert: list[str] | None = None,
    **kwargs,  # forwarded as quant_type_kwargs (e.g., group_size, inner_k_tiles)
)

Parameters

Parameter Type Default Description
quant_type AOBaseConfig (required) Quantization method name or config object
modules_to_not_convert None None Modules to skip during quantization
**kwargs dict {} Backend-specific keyword arguments (stored as quant_type_kwargs)

Supported String Shorthands

Category Shorthands
Integer 4-bit int4wo, int4_weight_only, int4dq, int8_dynamic_activation_int4_weight
Integer 8-bit int8wo, int8_weight_only, int8dq, int8_dynamic_activation_int8_weight
Float8 float8wo, float8wo_e5m2, float8wo_e4m3, float8dq, float8dq_e4m3
Unsigned int uint1wo through uint7wo

Validation Rules

  • If quant_type is a string, it must exist in the supported types registry
  • If quant_type is an AOBaseConfig instance, requires torchao > 0.9.0
  • Float8 types require CUDA capability >= 8.9
  • Unsupported kwargs for a given quant_type raise ValueError

Usage Example

from diffusers import FluxTransformer2DModel, TorchAoConfig
import torch

# String-based config
quantization_config = TorchAoConfig("int8wo")

# AOBaseConfig-based config (torchao > 0.9.0)
from torchao.quantization import Int8WeightOnlyConfig
quantization_config = TorchAoConfig(Int8WeightOnlyConfig())

transformer = FluxTransformer2DModel.from_pretrained(
    "black-forest-labs/Flux.1-Dev",
    subfolder="transformer",
    quantization_config=quantization_config,
    torch_dtype=torch.bfloat16,
)

QuantoConfig

Constructor Signature

QuantoConfig(
    weights_dtype: str = "int8",
    modules_to_not_convert: list[str] | None = None,
)

Parameters

Parameter Type Default Description
weights_dtype str "int8" Target weight dtype. One of: "float8", "int8", "int4", "int2"
modules_to_not_convert None None Modules to skip during quantization

Validation Rules

  • weights_dtype must be one of ["float8", "int8", "int4", "int2"]

Usage Example

from diffusers import FluxTransformer2DModel, QuantoConfig
import torch

config = QuantoConfig(weights_dtype="int4")
transformer = FluxTransformer2DModel.from_pretrained(
    "black-forest-labs/Flux.1-Dev",
    subfolder="transformer",
    quantization_config=config,
    torch_dtype=torch.bfloat16,
)

GGUFQuantizationConfig

Constructor Signature

GGUFQuantizationConfig(
    compute_dtype: torch.dtype | None = None,  # defaults to torch.float32
)

Parameters

Parameter Type Default Description
compute_dtype None torch.float32 Computation dtype for dequantized operations

Notes

  • Always sets pre_quantized = True -- GGUF only supports loading pre-quantized checkpoints
  • modules_to_not_convert is initialized to None internally

NVIDIAModelOptConfig

Constructor Signature

NVIDIAModelOptConfig(
    quant_type: str,                                    # e.g., "FP8", "INT8", "NF4", "FP8_INT8"
    modules_to_not_convert: list[str] | None = None,
    weight_only: bool = True,
    channel_quantize: int | None = None,
    block_quantize: int | None = None,
    scale_channel_quantize: int | None = None,
    scale_block_quantize: int | None = None,
    algorithm: str = "max",
    forward_loop: Callable | None = None,
    modelopt_config: dict | None = None,
    disable_conv_quantization: bool = False,
)

Parameters

Parameter Type Default Description
quant_type str (required) Format: "WEIGHT" or "WEIGHT_ACTIVATION" (e.g., "FP8", "FP8_INT8")
weight_only bool True If True, only quantize weights (disable activation quantizers)
algorithm str "max" Calibration algorithm
forward_loop None None Forward loop function for calibration

Supported Quant Types

Type Num Bits Description
FP8 (4, 3) 8-bit floating point
INT8 8 8-bit integer
INT4 4 4-bit integer
NF4 4 NormalFloat 4-bit (with 8-bit scales)
NVFP4 (2, 1) NVIDIA 4-bit float (with FP8 scales)

Serialization Round-Trip

All config classes support full round-trip serialization:

# Serialize to dict (stored in config.json)
config = BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_quant_type="nf4")
config_dict = config.to_dict()
# {"quant_method": "bitsandbytes", "load_in_4bit": true, "bnb_4bit_quant_type": "nf4", ...}

# Deserialize back
restored = BitsAndBytesConfig.from_dict(config_dict)

TorchAO handles AOBaseConfig serialization through torchao.core.config.config_to_dict and config_from_dict, wrapping the config in a {"default": {...}} dict structure.

Implementation Notes

  • QuantizationMethod enum: Uses str, Enum inheritance so that values are plain strings (e.g., "bitsandbytes", "torchao") for JSON compatibility.
  • BnB dtype coercion: BitsAndBytesConfig accepts string dtype names (e.g., "bfloat16") and converts them to torch.dtype via getattr(torch, name).
  • TorchAO quant_type_kwargs: Extra kwargs passed to TorchAoConfig.__init__ are stored as quant_type_kwargs and later passed to the TorchAO quantization function.
  • ModelOpt quant_type normalization: The _normalize_quant_type method splits the string on "_" to extract weight and activation types, falling back to defaults for unsupported values.

Related Pages

Requires Environment

Source References

  • src/diffusers/quantizers/quantization_config.py:L48-L54 - QuantizationMethod enum
  • src/diffusers/quantizers/quantization_config.py:L66-L178 - QuantizationConfigMixin
  • src/diffusers/quantizers/quantization_config.py:L181-L418 - BitsAndBytesConfig
  • src/diffusers/quantizers/quantization_config.py:L444-L824 - TorchAoConfig
  • src/diffusers/quantizers/quantization_config.py:L827-L859 - QuantoConfig
  • src/diffusers/quantizers/quantization_config.py:L421-L441 - GGUFQuantizationConfig
  • src/diffusers/quantizers/quantization_config.py:L862-L1051 - NVIDIAModelOptConfig

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment