Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Zai org CogVideo SAT Convert Weight

From Leeroopedia


Metadata

Field Value
Page Type Implementation (API Doc)
Knowledge Sources CogVideo
Domains Model_Conversion, Deployment
Last Updated 2026-02-10 00:00 GMT

Overview

Concrete tool for converting SAT checkpoint weights to HuggingFace Diffusers format provided by the CogVideo tools module. Supports both full transformer/VAE conversion and LoRA adapter weight export.

Description

The CogVideo repository provides two conversion scripts:

convert_weight_sat2hf.py

Converts the complete SAT transformer and/or VAE to a Diffusers-compatible pipeline directory. The conversion process:

  1. Loads the SAT checkpoint from disk using torch.load with memory mapping.
  2. Extracts the model state dict (handling nested model, module, or state_dict keys).
  3. Applies two stages of key remapping: string replacement via TRANSFORMER_KEYS_RENAME_DICT, followed by special handlers via TRANSFORMER_SPECIAL_KEYS_REMAP (QKV splitting, AdaLN chunking, layernorm remapping, unused key removal).
  4. Constructs a fresh CogVideoXTransformer3DModel with version-appropriate parameters and loads the converted state dict with strict matching.
  5. Optionally converts the VAE using similar key remapping.
  6. Assembles a complete CogVideoXPipeline (or CogVideoXImageToVideoPipeline for I2V) with the converted components, a T5-XXL tokenizer/encoder, and a CogVideoXDDIMScheduler.
  7. Saves the pipeline with safe serialization and 5GB shard limit.

export_sat_lora_weight.py

Exports only the LoRA adapter weights from a SAT checkpoint into PEFT-compatible safetensors format:

  1. Loads the SAT checkpoint and extracts the state dict.
  2. Filters keys for LoRA parameters (matrix_A and matrix_B) using the LORA_KEYS_RENAME mapping.
  3. Remaps SAT LoRA naming to Diffusers PEFT naming (e.g., matrix_A.0 to lora_A.weight for the Q projection).
  4. Validates exactly 240 LoRA parameters (for the CogVideoX-2B model with 30 layers).
  5. Writes the LoRA state dict using LoraBaseMixin.write_lora_layers with safe serialization.

Usage

Use these scripts after SAT training completes to convert checkpoints for Diffusers-based inference or HuggingFace Hub deployment.

Code Reference

Source Location

  • tools/convert_weight_sat2hf.py:L153-191 (convert_transformer)
  • tools/convert_weight_sat2hf.py:L194-215 (convert_vae)
  • tools/convert_weight_sat2hf.py:L326-403 (main block)
  • tools/export_sat_lora_weight.py:L36-62 (export_lora_weight)

Signature

# Full transformer conversion
def convert_transformer(
    ckpt_path: str,
    num_layers: int,           # 30 for 2B, 42 for 5B
    num_attention_heads: int,  # 30 for 2B, 48 for 5B
    use_rotary_positional_embeddings: bool,  # False for 2B, True for 5B
    i2v: bool,                 # True for Image-to-Video models
    dtype: torch.dtype,        # torch.float16 or torch.bfloat16
    init_kwargs: Dict[str, Any],  # Version-specific patch/sample params
) -> CogVideoXTransformer3DModel

# VAE conversion
def convert_vae(
    ckpt_path: str,
    scaling_factor: float,     # 1.15258426 for 2B, 0.7 for 5B
    version: str,              # "1.0" or "1.5"
    dtype: torch.dtype,
) -> AutoencoderKLCogVideoX

# LoRA weight export
def export_lora_weight(
    ckpt_path: str,
    lora_save_directory: str,
) -> None

Import

# As a script (most common usage)
python tools/convert_weight_sat2hf.py --transformer_ckpt_path ... --output_path ...
python tools/export_sat_lora_weight.py --sat_pt_path ... --lora_save_directory ...

# As a module (for programmatic use)
from tools.convert_weight_sat2hf import convert_transformer, convert_vae
from tools.export_sat_lora_weight import export_lora_weight

I/O Contract

Inputs (convert_weight_sat2hf.py)

Parameter Type Required Description
--transformer_ckpt_path str No Path to SAT transformer checkpoint (.pt file). If omitted, transformer is not converted.
--vae_ckpt_path str No Path to SAT VAE checkpoint (.pt file). If omitted, VAE is not converted.
--output_path str Yes Directory path where the converted Diffusers pipeline will be saved.
--num_layers int No Number of transformer blocks. Default: 30 (for 2B). Use 42 for 5B.
--num_attention_heads int No Number of attention heads. Default: 30 (for 2B). Use 48 for 5B.
--use_rotary_positional_embeddings flag No Enable RoPE. Default: False (for 2B). Set for 5B.
--scaling_factor float No VAE scaling factor. Default: 1.15258426 (for 2B). Use 0.7 for 5B.
--snr_shift_scale float No SNR shift scale for scheduler. Default: 3.0 (for 2B). Use 1.0 for 5B.
--i2v flag No Convert as Image-to-Video model (uses 32 input channels instead of 16).
--version str No CogVideoX version: "1.0" or "1.5". Default: "1.0".
--fp16 flag No Save model weights in float16 precision.
--bf16 flag No Save model weights in bfloat16 precision.
--text_encoder_cache_dir str No Path to cached T5-XXL text encoder weights.
--push_to_hub flag No Push converted model to HuggingFace Hub after saving.

Inputs (export_sat_lora_weight.py)

Parameter Type Required Description
--sat_pt_path str Yes Path to SAT checkpoint containing LoRA weights.
--lora_save_directory str Yes Directory path where pytorch_lora_weights.safetensors will be saved.

Outputs

Output Type Description
Diffusers pipeline directory Directory Complete HuggingFace-format pipeline with model_index.json, transformer, VAE, tokenizer, text_encoder, and scheduler subdirectories. Files use safe serialization with 5GB shard limit.
LoRA weights file File pytorch_lora_weights.safetensors in the specified save directory. Compatible with pipe.load_lora_weights().

Usage Examples

Convert Full CogVideoX-2B Transformer

python tools/convert_weight_sat2hf.py \
    --transformer_ckpt_path ckpts/transformer/1000/mp_rank_00_model_states.pt \
    --output_path output/cogvideox-2b-finetuned \
    --num_layers 30 \
    --num_attention_heads 30 \
    --fp16 \
    --text_encoder_cache_dir cache/t5-xxl

Convert CogVideoX-5B I2V Model

python tools/convert_weight_sat2hf.py \
    --transformer_ckpt_path ckpts/transformer/1000/mp_rank_00_model_states.pt \
    --output_path output/cogvideox-5b-i2v-finetuned \
    --num_layers 42 \
    --num_attention_heads 48 \
    --use_rotary_positional_embeddings \
    --i2v \
    --bf16 \
    --scaling_factor 0.7 \
    --snr_shift_scale 1.0

Export LoRA Weights

python tools/export_sat_lora_weight.py \
    --sat_pt_path ckpts_lora/transformer/1000/mp_rank_00_model_states.pt \
    --lora_save_directory output/lora_weights

Load Converted LoRA Weights in Diffusers

from diffusers import CogVideoXPipeline

pipe = CogVideoXPipeline.from_pretrained("THUDM/CogVideoX-2b")
pipe.load_lora_weights("output/lora_weights")

External Dependencies

  • diffusers: Provides CogVideoXTransformer3DModel, AutoencoderKLCogVideoX, CogVideoXPipeline, CogVideoXImageToVideoPipeline, CogVideoXDDIMScheduler, and LoraBaseMixin.
  • transformers: Provides T5Tokenizer and T5EncoderModel for text encoding.
  • torch: Checkpoint loading, tensor manipulation, and dtype conversion.
  • safetensors: Safe serialization format for model weights.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment