Principle:Zai org CogVideo SAT Weight Export
Metadata
| Field | Value |
|---|---|
| Page Type | Principle |
| Knowledge Sources | CogVideo |
| Domains | Model_Conversion, Deployment |
| Last Updated | 2026-02-10 00:00 GMT |
Overview
Technique for converting trained model weights from SAT format to HuggingFace Diffusers format for cross-framework compatibility.
Description
After training with the SAT framework, model weights must be converted to HuggingFace Diffusers format for use with the standard Diffusers inference pipeline. This conversion involves remapping layer names between SAT and Diffusers naming conventions, reshaping tensors where the two frameworks use different layouts, and packaging the result into HF-compatible checkpoint directories or safetensors files.
Full Model Conversion
Full model conversion (via convert_weight_sat2hf.py) transforms the complete transformer and optionally the VAE from SAT format to a Diffusers pipeline:
Transformer Conversion
The transformer conversion applies two stages of key remapping:
Stage 1: String replacement using TRANSFORMER_KEYS_RENAME_DICT:
transformer.final_layernormbecomesnorm_finaltransformerbecomestransformer_blocksattentionbecomesattn1mlpbecomesff.netdense_h_to_4hbecomes0.projdense_4h_to_hbecomes2.layersis removed (direct block indexing)densebecomesto_out.0input_layernormbecomesnorm1.normpost_attn1_layernormbecomesnorm2.norm- Time and OFS embeddings, patch embeddings, and final layer mappings
Stage 2: Special key handlers using TRANSFORMER_SPECIAL_KEYS_REMAP:
- query_key_value: SAT fuses Q, K, V into a single tensor. The handler splits this into three separate tensors (
to_q,to_k,to_v) usingtorch.chunk(chunks=3). - adaln_layer.adaLN_modulations: SAT stores all 12 AdaLN modulation parameters in a single tensor. The handler chunks this into norm1 and norm2 parameters (6 each), rearranging to match the Diffusers layout.
- query/key_layernorm_list: QK normalization layers are remapped to
attn1.norm_qandattn1.norm_k. - embed_tokens, freqs_sin, freqs_cos, position_embedding: These keys are removed as Diffusers computes them internally.
The converted state dict is loaded into a fresh CogVideoXTransformer3DModel instance with strict key matching.
VAE Conversion
The VAE conversion applies similar two-stage remapping:
- Block, downsampling, and upsampling layer names are translated between SAT and Diffusers conventions.
- Up-block layer indices are inverted (SAT stores them in encoder order; Diffusers uses reversed order).
- Loss-related keys are removed.
Pipeline Assembly
After converting individual components, the script assembles a complete Diffusers pipeline:
- Converted transformer (or original if not converting).
- Converted VAE (or original if not converting; VAE is kept in float32 for quality).
- T5-XXL tokenizer and text encoder loaded from
google/t5-v1_1-xxl. - CogVideoXDDIMScheduler with model-specific parameters (snr_shift_scale, v_prediction, ZeroSNR).
The pipeline is saved via pipe.save_pretrained with safe serialization and a 5GB shard size limit.
LoRA Weight Export
LoRA weight export (via export_sat_lora_weight.py) is a lighter conversion that extracts only the LoRA adapter weights:
- Loads the SAT checkpoint and extracts the state dict.
- Iterates over all keys, filtering for LoRA-specific parameter names containing
matrix_Aormatrix_B. - Remaps SAT LoRA naming to Diffusers PEFT-compatible naming:
attention.query_key_value.matrix_A.0becomesattn1.to_q.lora_A.weightattention.query_key_value.matrix_B.0becomesattn1.to_q.lora_B.weight- (And similarly for K, V, and output projection matrices)
layersis replaced withtransformer_blocks
- Validates that exactly 240 LoRA parameters are extracted (30 layers x 8 matrices per layer for the 2B model).
- Saves via
LoraBaseMixin.write_lora_layerswith safe serialization, producingpytorch_lora_weights.safetensors.
Model Version Differences
The conversion scripts support different CogVideoX versions through version-specific parameters:
| Parameter | CogVideoX-2B | CogVideoX-5B | CogVideoX-1.5 |
|---|---|---|---|
| num_layers | 30 | 42 | 42 |
| num_attention_heads | 30 | 48 | 48 |
| use_rotary_positional_embeddings | False | True | True |
| scaling_factor (VAE) | 1.15258426 | 0.7 | 0.7 |
| snr_shift_scale | 3.0 | 1.0 | 1.0 |
| in_channels | 16 (T2V) / 32 (I2V) | 16 (T2V) / 32 (I2V) | 16 (T2V) / 32 (I2V) |
| version flag | "1.0" | "1.0" | "1.5" |
| patch_size_t | None | None | 2 |
Usage
Use after completing SAT-based training to enable inference via the Diffusers pipeline. This conversion is required for deploying SAT-trained models in production or sharing them on the HuggingFace Hub.
- LoRA models: Use
export_sat_lora_weight.pyto extract adapter weights aspytorch_lora_weights.safetensors. These can be loaded viapipe.load_lora_weights()in Diffusers. - Full fine-tuned models: Use
convert_weight_sat2hf.pyto convert the complete transformer (and optionally VAE) into a Diffusers pipeline directory.
Theoretical Basis
Cross-Framework Naming Conventions
SAT and Diffusers use different naming conventions rooted in their respective design philosophies. SAT follows the GPT-style naming from Megatron-LM (e.g., query_key_value for fused QKV, dense_h_to_4h for MLP up-projection), while Diffusers follows the convention established by the original Stable Diffusion codebase (e.g., to_q/to_k/to_v for separate attention projections, ff.net.0.proj for MLP). The conversion scripts bridge these conventions through deterministic string replacement and tensor splitting.
QKV Fusion and Splitting
SAT stores query, key, and value projections as a single fused tensor [3 * hidden_dim, hidden_dim] for computational efficiency (single matrix multiplication instead of three). Diffusers stores them as three separate tensors. The conversion splits the fused tensor along dimension 0 using torch.chunk(chunks=3).
LoRA Weight Structure
LoRA adapters add low-rank decomposition to existing linear layers: output = W @ x + (B @ A) @ x where A has shape [r, in_features] and B has shape [out_features, r]. SAT stores these as matrix_A and matrix_B parameters, while the Diffusers PEFT format stores them as lora_A.weight and lora_B.weight. For fused QKV layers, SAT uses indexed suffixes (matrix_A.0, matrix_A.1, matrix_A.2) corresponding to Q, K, V respectively.