Principle:Huggingface Diffusers Quantized Model Saving

Overview

Quantized Model Saving covers the serialization of quantized models to disk while preserving the quantization configuration, quantized weight format, and all metadata necessary for correct reloading. When a quantized model is saved, it must persist not only the quantized weight tensors but also the quantization_config in the model's config.json, enabling future from_pretrained calls to automatically detect and correctly load the quantized checkpoint without requiring the user to re-specify the quantization configuration.

Theoretical Foundation

Quantization-Aware Serialization

Saving a quantized model is conceptually different from saving a full-precision model because:

Weight format differs from standard tensors: Quantized weights may be packed (e.g., two 4-bit values in a single byte), stored as custom tensor types, or accompanied by auxiliary tensors (scales, zero-points).
Metadata must be preserved: The quantization_config (backend type, bit-width, quant type, compute dtype) must be stored so the loader knows how to interpret the weight format.
Not all backends are serializable: Some quantization methods produce models that cannot be meaningfully serialized. The is_serializable property on the quantizer controls this.

The Serialization Contract

For a quantized model to be saved, the following conditions must hold:

The quantizer must implement is_serializable = True
The model's state_dict() must return the quantized weight tensors in a format compatible with the serialization backend (safetensors or pickle)
The model's config.json must include a quantization_config section with all parameters needed to reconstruct the quantizer at load time

If any of these conditions fail, save_pretrained raises a ValueError explaining why the model cannot be saved.

Config.json Metadata

When a quantized model is saved, its config.json contains an additional quantization_config field. This field is a JSON-serializable dictionary produced by the config object's to_dict() method. For example:

{
  "_class_name": "FluxTransformer2DModel",
  "_diffusers_version": "0.32.0",
  "quantization_config": {
    "quant_method": "bitsandbytes",
    "load_in_4bit": true,
    "bnb_4bit_quant_type": "nf4",
    "bnb_4bit_compute_dtype": "bfloat16",
    "bnb_4bit_use_double_quant": false,
    "bnb_4bit_quant_storage": "uint8"
  }
}

This metadata is the bridge between saving and loading: on the next from_pretrained call, the loader detects this config and automatically sets pre_quantized=True.

Safetensors Compatibility

Safetensors is the default and recommended serialization format in Diffusers (safe_serialization=True). For quantized models, safetensors compatibility depends on the backend:

BitsAndBytes: Quantized weights are stored as standard uint8 or float16 tensors in safetensors. Auxiliary tensors (absmax, code, quant_state) are stored alongside.
TorchAO: Tensor subclasses need to be serializable to standard tensor types for safetensors compatibility. The state_dict() call handles this conversion.
Quanto: Quantized linear layers store their packed weights and scales as standard tensors.
GGUF: GGUF uses its own file format; saving back to safetensors may not preserve the original GGUF quantization format.

Sharded Checkpoints

Large quantized models support sharded saving via max_shard_size. The weight tensors are split across multiple safetensors files, with an index file (model.safetensors.index.json) mapping parameter names to shard files. The quantization config is stored in config.json independently of the sharding.

Pre-quantization Dtype Handling

During loading, the original torch_dtype is stored as _pre_quantization_dtype in the model config. This is a torch.dtype object that is not JSON-serializable. During saving:

save_config() handles the serialization of the config, including the quantization_config dict
The _pre_quantization_dtype is purged or handled specially since it cannot be directly stored in JSON

Pipeline-Level Saving

When saving a pipeline with DiffusionPipeline.save_pretrained(), each component is saved individually to its own subdirectory. The pipeline iterates over its saveable modules, determines the correct save method for each, and calls it with appropriate arguments. Quantized components are saved by their model-level save_pretrained, which handles the quantization-specific logic. The pipeline's top-level model_index.json is unaffected by quantization.

Key Design Decisions

Serialization gate: The is_serializable abstract property forces each backend to explicitly declare whether its quantized models can be saved. This prevents silent data corruption from saving unsupported formats.
Standard save_pretrained API: No special save method is needed for quantized models. The same model.save_pretrained() / pipeline.save_pretrained() API works for both quantized and non-quantized models.
Config-driven reloading: By embedding quantization_config in config.json, saved models are self-describing. Users can load them without knowing the quantization details.
Safetensors as default: Using safetensors (safe_serialization=True) by default ensures safe, efficient, and zero-copy-capable weight storage.

Related Pages

Implemented By

Implementation:Huggingface_Diffusers_Save_Pretrained_Quantized

Huggingface_Diffusers_Save_Pretrained_Quantized - Implementation of the save flow
Huggingface_Diffusers_Quantized_Model_Loading - The loading counterpart that reads saved quantized models
Huggingface_Diffusers_Quantization_Configuration - Config objects that are serialized into config.json
Huggingface_Diffusers_Quantized_Inference - Running inference before saving

Source References

src/diffusers/models/modeling_utils.py:L667-L820 - ModelMixin.save_pretrained with quantization checks
src/diffusers/pipelines/pipeline_utils.py:L240-L371 - DiffusionPipeline.save_pretrained
src/diffusers/quantizers/base.py:L234-L236 - is_serializable abstract property
src/diffusers/quantizers/quantization_config.py:L125-L130 - to_dict serialization

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment