Principle:Huggingface Diffusers Quantized Model Saving
Overview
Quantized Model Saving covers the serialization of quantized models to disk while preserving the quantization configuration, quantized weight format, and all metadata necessary for correct reloading. When a quantized model is saved, it must persist not only the quantized weight tensors but also the quantization_config in the model's config.json, enabling future from_pretrained calls to automatically detect and correctly load the quantized checkpoint without requiring the user to re-specify the quantization configuration.
Theoretical Foundation
Quantization-Aware Serialization
Saving a quantized model is conceptually different from saving a full-precision model because:
- Weight format differs from standard tensors: Quantized weights may be packed (e.g., two 4-bit values in a single byte), stored as custom tensor types, or accompanied by auxiliary tensors (scales, zero-points).
- Metadata must be preserved: The
quantization_config(backend type, bit-width, quant type, compute dtype) must be stored so the loader knows how to interpret the weight format. - Not all backends are serializable: Some quantization methods produce models that cannot be meaningfully serialized. The
is_serializableproperty on the quantizer controls this.
The Serialization Contract
For a quantized model to be saved, the following conditions must hold:
- The quantizer must implement
is_serializable = True - The model's
state_dict()must return the quantized weight tensors in a format compatible with the serialization backend (safetensors or pickle) - The model's
config.jsonmust include aquantization_configsection with all parameters needed to reconstruct the quantizer at load time
If any of these conditions fail, save_pretrained raises a ValueError explaining why the model cannot be saved.
Config.json Metadata
When a quantized model is saved, its config.json contains an additional quantization_config field. This field is a JSON-serializable dictionary produced by the config object's to_dict() method. For example:
{
"_class_name": "FluxTransformer2DModel",
"_diffusers_version": "0.32.0",
"quantization_config": {
"quant_method": "bitsandbytes",
"load_in_4bit": true,
"bnb_4bit_quant_type": "nf4",
"bnb_4bit_compute_dtype": "bfloat16",
"bnb_4bit_use_double_quant": false,
"bnb_4bit_quant_storage": "uint8"
}
}
This metadata is the bridge between saving and loading: on the next from_pretrained call, the loader detects this config and automatically sets pre_quantized=True.
Safetensors Compatibility
Safetensors is the default and recommended serialization format in Diffusers (safe_serialization=True). For quantized models, safetensors compatibility depends on the backend:
- BitsAndBytes: Quantized weights are stored as standard uint8 or float16 tensors in safetensors. Auxiliary tensors (absmax, code, quant_state) are stored alongside.
- TorchAO: Tensor subclasses need to be serializable to standard tensor types for safetensors compatibility. The
state_dict()call handles this conversion. - Quanto: Quantized linear layers store their packed weights and scales as standard tensors.
- GGUF: GGUF uses its own file format; saving back to safetensors may not preserve the original GGUF quantization format.
Sharded Checkpoints
Large quantized models support sharded saving via max_shard_size. The weight tensors are split across multiple safetensors files, with an index file (model.safetensors.index.json) mapping parameter names to shard files. The quantization config is stored in config.json independently of the sharding.
Pre-quantization Dtype Handling
During loading, the original torch_dtype is stored as _pre_quantization_dtype in the model config. This is a torch.dtype object that is not JSON-serializable. During saving:
save_config()handles the serialization of the config, including thequantization_configdict- The
_pre_quantization_dtypeis purged or handled specially since it cannot be directly stored in JSON
Pipeline-Level Saving
When saving a pipeline with DiffusionPipeline.save_pretrained(), each component is saved individually to its own subdirectory. The pipeline iterates over its saveable modules, determines the correct save method for each, and calls it with appropriate arguments. Quantized components are saved by their model-level save_pretrained, which handles the quantization-specific logic. The pipeline's top-level model_index.json is unaffected by quantization.
Key Design Decisions
- Serialization gate: The
is_serializableabstract property forces each backend to explicitly declare whether its quantized models can be saved. This prevents silent data corruption from saving unsupported formats. - Standard save_pretrained API: No special save method is needed for quantized models. The same
model.save_pretrained()/pipeline.save_pretrained()API works for both quantized and non-quantized models. - Config-driven reloading: By embedding
quantization_configinconfig.json, saved models are self-describing. Users can load them without knowing the quantization details. - Safetensors as default: Using safetensors (
safe_serialization=True) by default ensures safe, efficient, and zero-copy-capable weight storage.
Related Pages
Implemented By
- Huggingface_Diffusers_Save_Pretrained_Quantized - Implementation of the save flow
- Huggingface_Diffusers_Quantized_Model_Loading - The loading counterpart that reads saved quantized models
- Huggingface_Diffusers_Quantization_Configuration - Config objects that are serialized into config.json
- Huggingface_Diffusers_Quantized_Inference - Running inference before saving
Source References
src/diffusers/models/modeling_utils.py:L667-L820- ModelMixin.save_pretrained with quantization checkssrc/diffusers/pipelines/pipeline_utils.py:L240-L371- DiffusionPipeline.save_pretrainedsrc/diffusers/quantizers/base.py:L234-L236- is_serializable abstract propertysrc/diffusers/quantizers/quantization_config.py:L125-L130- to_dict serialization