Principle:Huggingface Diffusers Checkpoint Format Identification
| Property | Value |
|---|---|
| Principle Name | Checkpoint Format Identification |
| Overview | Identifying the architecture and format of an original model checkpoint through key-based detection and tensor shape analysis |
| Domains | Model Conversion, Checkpoint Analysis |
| Related Implementation | Huggingface_Diffusers_Infer_Model_Type |
| Knowledge Sources | Repo (https://github.com/huggingface/diffusers), Source (src/diffusers/loaders/single_file_utils.py:L583-L817)
|
| Last Updated | 2026-02-13 00:00 GMT |
Description
Before converting a checkpoint to the Diffusers format, the system must identify which architecture the checkpoint represents. This is a non-trivial problem because original checkpoints from different model families (Stable Diffusion v1/v2/XL, Flux, CogVideoX, Wan, HunyuanVideo, etc.) use different key naming conventions, different weight shapes, and different organizational structures.
The identification process uses a key fingerprinting approach: specific weight keys and their tensor shapes serve as unique identifiers for each model type.
Theoretical Basis
Key-Based Architecture Detection
Each model architecture has distinctive weight key names that act as fingerprints. The detection strategy uses a priority-ordered cascade of checks:
- Inpainting detection: Check if the UNet input convolution has 9 input channels (4 latent + 4 masked latent + 1 mask)
- Version detection: Check for SD v2-specific keys and 1024-dimensional projections
- Architecture-specific keys: Each model family has unique layer names:
- Flux:
double_blocks.*.img_attn.qkv.weight/single_blocks.*.linear1.weight - Wan:
patch_embedding.weightwith specific output dimensions (1536 for 1.3B, 5120 for 14B) - HunyuanVideo: Architecture-specific key patterns
- CogVideoX: Distinguished by its rotary embedding keys
- Flux:
- Shape-based disambiguation: When keys overlap, tensor shapes differentiate variants:
- Wan T2V vs I2V:
patch_embedding.weight.shape[1] == 16(T2V) vs other (I2V) - Flux dev vs schnell: Presence of
guidance_inkeys - SD3 vs SD3.5:
pos_embedshape (36864 vs 147456)
- Wan T2V vs I2V:
Checkpoint Structure Patterns
Original checkpoints may have keys with or without a common prefix:
- Bare keys:
patch_embedding.weight(exported directly) - Prefixed keys:
model.diffusion_model.patch_embedding.weight(from training frameworks)
The detection must handle both patterns, often by checking for both variants of the same key.
Model Type to Config Mapping
Once the model type string is determined (e.g., "wan-t2v-14B", "flux-dev"), it maps to a default pipeline configuration in DIFFUSERS_DEFAULT_PIPELINE_PATHS. This configuration provides:
- The pretrained model name to fetch the Diffusers config from
- The subfolder structure for each component
- The pipeline class to use
Usage
Format identification is the first step in any checkpoint conversion workflow:
- Load the checkpoint file (safetensors or ckpt) into a dictionary
- Pass the dictionary to
infer_diffusers_model_type(checkpoint) - The returned model type string determines which conversion function and config to use
- If the model type cannot be determined, it defaults to
"v1"(Stable Diffusion v1.x)
Related Pages
- Huggingface_Diffusers_Infer_Model_Type (implements this principle) - Concrete detection function
- Huggingface_Diffusers_Conversion_Script_Selection (next step) - Using the model type to select a conversion script
- Huggingface_Diffusers_Weight_Mapping (next step) - Actual key remapping after identification
- Huggingface_Diffusers_Single_File_Loading (orchestrator) - The from_single_file flow that uses identification