Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:Zai org CogVideo SAT Weight Export

From Leeroopedia


Metadata

Field Value
Page Type Principle
Knowledge Sources CogVideo
Domains Model_Conversion, Deployment
Last Updated 2026-02-10 00:00 GMT

Overview

Technique for converting trained model weights from SAT format to HuggingFace Diffusers format for cross-framework compatibility.

Description

After training with the SAT framework, model weights must be converted to HuggingFace Diffusers format for use with the standard Diffusers inference pipeline. This conversion involves remapping layer names between SAT and Diffusers naming conventions, reshaping tensors where the two frameworks use different layouts, and packaging the result into HF-compatible checkpoint directories or safetensors files.

Full Model Conversion

Full model conversion (via convert_weight_sat2hf.py) transforms the complete transformer and optionally the VAE from SAT format to a Diffusers pipeline:

Transformer Conversion

The transformer conversion applies two stages of key remapping:

Stage 1: String replacement using TRANSFORMER_KEYS_RENAME_DICT:

  • transformer.final_layernorm becomes norm_final
  • transformer becomes transformer_blocks
  • attention becomes attn1
  • mlp becomes ff.net
  • dense_h_to_4h becomes 0.proj
  • dense_4h_to_h becomes 2
  • .layers is removed (direct block indexing)
  • dense becomes to_out.0
  • input_layernorm becomes norm1.norm
  • post_attn1_layernorm becomes norm2.norm
  • Time and OFS embeddings, patch embeddings, and final layer mappings

Stage 2: Special key handlers using TRANSFORMER_SPECIAL_KEYS_REMAP:

  • query_key_value: SAT fuses Q, K, V into a single tensor. The handler splits this into three separate tensors (to_q, to_k, to_v) using torch.chunk(chunks=3).
  • adaln_layer.adaLN_modulations: SAT stores all 12 AdaLN modulation parameters in a single tensor. The handler chunks this into norm1 and norm2 parameters (6 each), rearranging to match the Diffusers layout.
  • query/key_layernorm_list: QK normalization layers are remapped to attn1.norm_q and attn1.norm_k.
  • embed_tokens, freqs_sin, freqs_cos, position_embedding: These keys are removed as Diffusers computes them internally.

The converted state dict is loaded into a fresh CogVideoXTransformer3DModel instance with strict key matching.

VAE Conversion

The VAE conversion applies similar two-stage remapping:

  • Block, downsampling, and upsampling layer names are translated between SAT and Diffusers conventions.
  • Up-block layer indices are inverted (SAT stores them in encoder order; Diffusers uses reversed order).
  • Loss-related keys are removed.

Pipeline Assembly

After converting individual components, the script assembles a complete Diffusers pipeline:

  1. Converted transformer (or original if not converting).
  2. Converted VAE (or original if not converting; VAE is kept in float32 for quality).
  3. T5-XXL tokenizer and text encoder loaded from google/t5-v1_1-xxl.
  4. CogVideoXDDIMScheduler with model-specific parameters (snr_shift_scale, v_prediction, ZeroSNR).

The pipeline is saved via pipe.save_pretrained with safe serialization and a 5GB shard size limit.

LoRA Weight Export

LoRA weight export (via export_sat_lora_weight.py) is a lighter conversion that extracts only the LoRA adapter weights:

  1. Loads the SAT checkpoint and extracts the state dict.
  2. Iterates over all keys, filtering for LoRA-specific parameter names containing matrix_A or matrix_B.
  3. Remaps SAT LoRA naming to Diffusers PEFT-compatible naming:
    • attention.query_key_value.matrix_A.0 becomes attn1.to_q.lora_A.weight
    • attention.query_key_value.matrix_B.0 becomes attn1.to_q.lora_B.weight
    • (And similarly for K, V, and output projection matrices)
    • layers is replaced with transformer_blocks
  4. Validates that exactly 240 LoRA parameters are extracted (30 layers x 8 matrices per layer for the 2B model).
  5. Saves via LoraBaseMixin.write_lora_layers with safe serialization, producing pytorch_lora_weights.safetensors.

Model Version Differences

The conversion scripts support different CogVideoX versions through version-specific parameters:

Parameter CogVideoX-2B CogVideoX-5B CogVideoX-1.5
num_layers 30 42 42
num_attention_heads 30 48 48
use_rotary_positional_embeddings False True True
scaling_factor (VAE) 1.15258426 0.7 0.7
snr_shift_scale 3.0 1.0 1.0
in_channels 16 (T2V) / 32 (I2V) 16 (T2V) / 32 (I2V) 16 (T2V) / 32 (I2V)
version flag "1.0" "1.0" "1.5"
patch_size_t None None 2

Usage

Use after completing SAT-based training to enable inference via the Diffusers pipeline. This conversion is required for deploying SAT-trained models in production or sharing them on the HuggingFace Hub.

  • LoRA models: Use export_sat_lora_weight.py to extract adapter weights as pytorch_lora_weights.safetensors. These can be loaded via pipe.load_lora_weights() in Diffusers.
  • Full fine-tuned models: Use convert_weight_sat2hf.py to convert the complete transformer (and optionally VAE) into a Diffusers pipeline directory.

Theoretical Basis

Cross-Framework Naming Conventions

SAT and Diffusers use different naming conventions rooted in their respective design philosophies. SAT follows the GPT-style naming from Megatron-LM (e.g., query_key_value for fused QKV, dense_h_to_4h for MLP up-projection), while Diffusers follows the convention established by the original Stable Diffusion codebase (e.g., to_q/to_k/to_v for separate attention projections, ff.net.0.proj for MLP). The conversion scripts bridge these conventions through deterministic string replacement and tensor splitting.

QKV Fusion and Splitting

SAT stores query, key, and value projections as a single fused tensor [3 * hidden_dim, hidden_dim] for computational efficiency (single matrix multiplication instead of three). Diffusers stores them as three separate tensors. The conversion splits the fused tensor along dimension 0 using torch.chunk(chunks=3).

LoRA Weight Structure

LoRA adapters add low-rank decomposition to existing linear layers: output = W @ x + (B @ A) @ x where A has shape [r, in_features] and B has shape [out_features, r]. SAT stores these as matrix_A and matrix_B parameters, while the Diffusers PEFT format stores them as lora_A.weight and lora_B.weight. For fused QKV layers, SAT uses indexed suffixes (matrix_A.0, matrix_A.1, matrix_A.2) corresponding to Q, K, V respectively.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment