Implementation:Zai org CogVideo SAT Convert Weight

Metadata

Field	Value
Page Type	Implementation (API Doc)
Knowledge Sources	CogVideo
Domains	Model_Conversion, Deployment
Last Updated	2026-02-10 00:00 GMT

Overview

Concrete tool for converting SAT checkpoint weights to HuggingFace Diffusers format provided by the CogVideo tools module. Supports both full transformer/VAE conversion and LoRA adapter weight export.

Description

The CogVideo repository provides two conversion scripts:

convert_weight_sat2hf.py

Converts the complete SAT transformer and/or VAE to a Diffusers-compatible pipeline directory. The conversion process:

Loads the SAT checkpoint from disk using torch.load with memory mapping.
Extracts the model state dict (handling nested model, module, or state_dict keys).
Applies two stages of key remapping: string replacement via TRANSFORMER_KEYS_RENAME_DICT, followed by special handlers via TRANSFORMER_SPECIAL_KEYS_REMAP (QKV splitting, AdaLN chunking, layernorm remapping, unused key removal).
Constructs a fresh CogVideoXTransformer3DModel with version-appropriate parameters and loads the converted state dict with strict matching.
Optionally converts the VAE using similar key remapping.
Assembles a complete CogVideoXPipeline (or CogVideoXImageToVideoPipeline for I2V) with the converted components, a T5-XXL tokenizer/encoder, and a CogVideoXDDIMScheduler.
Saves the pipeline with safe serialization and 5GB shard limit.

export_sat_lora_weight.py

Exports only the LoRA adapter weights from a SAT checkpoint into PEFT-compatible safetensors format:

Loads the SAT checkpoint and extracts the state dict.
Filters keys for LoRA parameters (matrix_A and matrix_B) using the LORA_KEYS_RENAME mapping.
Remaps SAT LoRA naming to Diffusers PEFT naming (e.g., matrix_A.0 to lora_A.weight for the Q projection).
Validates exactly 240 LoRA parameters (for the CogVideoX-2B model with 30 layers).
Writes the LoRA state dict using LoraBaseMixin.write_lora_layers with safe serialization.

Usage

Use these scripts after SAT training completes to convert checkpoints for Diffusers-based inference or HuggingFace Hub deployment.

Code Reference

Source Location

tools/convert_weight_sat2hf.py:L153-191 (convert_transformer)
tools/convert_weight_sat2hf.py:L194-215 (convert_vae)
tools/convert_weight_sat2hf.py:L326-403 (main block)
tools/export_sat_lora_weight.py:L36-62 (export_lora_weight)

Signature

# Full transformer conversion
def convert_transformer(
    ckpt_path: str,
    num_layers: int,           # 30 for 2B, 42 for 5B
    num_attention_heads: int,  # 30 for 2B, 48 for 5B
    use_rotary_positional_embeddings: bool,  # False for 2B, True for 5B
    i2v: bool,                 # True for Image-to-Video models
    dtype: torch.dtype,        # torch.float16 or torch.bfloat16
    init_kwargs: Dict[str, Any],  # Version-specific patch/sample params
) -> CogVideoXTransformer3DModel

# VAE conversion
def convert_vae(
    ckpt_path: str,
    scaling_factor: float,     # 1.15258426 for 2B, 0.7 for 5B
    version: str,              # "1.0" or "1.5"
    dtype: torch.dtype,
) -> AutoencoderKLCogVideoX

# LoRA weight export
def export_lora_weight(
    ckpt_path: str,
    lora_save_directory: str,
) -> None

Import

# As a script (most common usage)
python tools/convert_weight_sat2hf.py --transformer_ckpt_path ... --output_path ...
python tools/export_sat_lora_weight.py --sat_pt_path ... --lora_save_directory ...

# As a module (for programmatic use)
from tools.convert_weight_sat2hf import convert_transformer, convert_vae
from tools.export_sat_lora_weight import export_lora_weight

I/O Contract

Inputs (convert_weight_sat2hf.py)

Parameter	Type	Required	Description
`--transformer_ckpt_path`	str	No	Path to SAT transformer checkpoint (.pt file). If omitted, transformer is not converted.
`--vae_ckpt_path`	str	No	Path to SAT VAE checkpoint (.pt file). If omitted, VAE is not converted.
`--output_path`	str	Yes	Directory path where the converted Diffusers pipeline will be saved.
`--num_layers`	int	No	Number of transformer blocks. Default: 30 (for 2B). Use 42 for 5B.
`--num_attention_heads`	int	No	Number of attention heads. Default: 30 (for 2B). Use 48 for 5B.
`--use_rotary_positional_embeddings`	flag	No	Enable RoPE. Default: False (for 2B). Set for 5B.
`--scaling_factor`	float	No	VAE scaling factor. Default: 1.15258426 (for 2B). Use 0.7 for 5B.
`--snr_shift_scale`	float	No	SNR shift scale for scheduler. Default: 3.0 (for 2B). Use 1.0 for 5B.
`--i2v`	flag	No	Convert as Image-to-Video model (uses 32 input channels instead of 16).
`--version`	str	No	CogVideoX version: `"1.0"` or `"1.5"`. Default: `"1.0"`.
`--fp16`	flag	No	Save model weights in float16 precision.
`--bf16`	flag	No	Save model weights in bfloat16 precision.
`--text_encoder_cache_dir`	str	No	Path to cached T5-XXL text encoder weights.
`--push_to_hub`	flag	No	Push converted model to HuggingFace Hub after saving.

Inputs (export_sat_lora_weight.py)

Parameter	Type	Required	Description
`--sat_pt_path`	str	Yes	Path to SAT checkpoint containing LoRA weights.
`--lora_save_directory`	str	Yes	Directory path where `pytorch_lora_weights.safetensors` will be saved.

Outputs

Output	Type	Description
Diffusers pipeline directory	Directory	Complete HuggingFace-format pipeline with `model_index.json`, transformer, VAE, tokenizer, text_encoder, and scheduler subdirectories. Files use safe serialization with 5GB shard limit.
LoRA weights file	File	`pytorch_lora_weights.safetensors` in the specified save directory. Compatible with `pipe.load_lora_weights()`.

Usage Examples

Convert Full CogVideoX-2B Transformer

python tools/convert_weight_sat2hf.py \
    --transformer_ckpt_path ckpts/transformer/1000/mp_rank_00_model_states.pt \
    --output_path output/cogvideox-2b-finetuned \
    --num_layers 30 \
    --num_attention_heads 30 \
    --fp16 \
    --text_encoder_cache_dir cache/t5-xxl

Convert CogVideoX-5B I2V Model

python tools/convert_weight_sat2hf.py \
    --transformer_ckpt_path ckpts/transformer/1000/mp_rank_00_model_states.pt \
    --output_path output/cogvideox-5b-i2v-finetuned \
    --num_layers 42 \
    --num_attention_heads 48 \
    --use_rotary_positional_embeddings \
    --i2v \
    --bf16 \
    --scaling_factor 0.7 \
    --snr_shift_scale 1.0

Export LoRA Weights

python tools/export_sat_lora_weight.py \
    --sat_pt_path ckpts_lora/transformer/1000/mp_rank_00_model_states.pt \
    --lora_save_directory output/lora_weights

Load Converted LoRA Weights in Diffusers

from diffusers import CogVideoXPipeline

pipe = CogVideoXPipeline.from_pretrained("THUDM/CogVideoX-2b")
pipe.load_lora_weights("output/lora_weights")

External Dependencies

diffusers: Provides CogVideoXTransformer3DModel, AutoencoderKLCogVideoX, CogVideoXPipeline, CogVideoXImageToVideoPipeline, CogVideoXDDIMScheduler, and LoraBaseMixin.
transformers: Provides T5Tokenizer and T5EncoderModel for text encoding.
torch: Checkpoint loading, tensor manipulation, and dtype conversion.
safetensors: Safe serialization format for model weights.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment