Implementation:Zai org CogVideo SAT Convert Weight
Metadata
| Field | Value |
|---|---|
| Page Type | Implementation (API Doc) |
| Knowledge Sources | CogVideo |
| Domains | Model_Conversion, Deployment |
| Last Updated | 2026-02-10 00:00 GMT |
Overview
Concrete tool for converting SAT checkpoint weights to HuggingFace Diffusers format provided by the CogVideo tools module. Supports both full transformer/VAE conversion and LoRA adapter weight export.
Description
The CogVideo repository provides two conversion scripts:
convert_weight_sat2hf.py
Converts the complete SAT transformer and/or VAE to a Diffusers-compatible pipeline directory. The conversion process:
- Loads the SAT checkpoint from disk using
torch.loadwith memory mapping. - Extracts the model state dict (handling nested
model,module, orstate_dictkeys). - Applies two stages of key remapping: string replacement via
TRANSFORMER_KEYS_RENAME_DICT, followed by special handlers viaTRANSFORMER_SPECIAL_KEYS_REMAP(QKV splitting, AdaLN chunking, layernorm remapping, unused key removal). - Constructs a fresh
CogVideoXTransformer3DModelwith version-appropriate parameters and loads the converted state dict with strict matching. - Optionally converts the VAE using similar key remapping.
- Assembles a complete
CogVideoXPipeline(orCogVideoXImageToVideoPipelinefor I2V) with the converted components, a T5-XXL tokenizer/encoder, and aCogVideoXDDIMScheduler. - Saves the pipeline with safe serialization and 5GB shard limit.
export_sat_lora_weight.py
Exports only the LoRA adapter weights from a SAT checkpoint into PEFT-compatible safetensors format:
- Loads the SAT checkpoint and extracts the state dict.
- Filters keys for LoRA parameters (
matrix_Aandmatrix_B) using theLORA_KEYS_RENAMEmapping. - Remaps SAT LoRA naming to Diffusers PEFT naming (e.g.,
matrix_A.0tolora_A.weightfor the Q projection). - Validates exactly 240 LoRA parameters (for the CogVideoX-2B model with 30 layers).
- Writes the LoRA state dict using
LoraBaseMixin.write_lora_layerswith safe serialization.
Usage
Use these scripts after SAT training completes to convert checkpoints for Diffusers-based inference or HuggingFace Hub deployment.
Code Reference
Source Location
tools/convert_weight_sat2hf.py:L153-191(convert_transformer)tools/convert_weight_sat2hf.py:L194-215(convert_vae)tools/convert_weight_sat2hf.py:L326-403(main block)tools/export_sat_lora_weight.py:L36-62(export_lora_weight)
Signature
# Full transformer conversion
def convert_transformer(
ckpt_path: str,
num_layers: int, # 30 for 2B, 42 for 5B
num_attention_heads: int, # 30 for 2B, 48 for 5B
use_rotary_positional_embeddings: bool, # False for 2B, True for 5B
i2v: bool, # True for Image-to-Video models
dtype: torch.dtype, # torch.float16 or torch.bfloat16
init_kwargs: Dict[str, Any], # Version-specific patch/sample params
) -> CogVideoXTransformer3DModel
# VAE conversion
def convert_vae(
ckpt_path: str,
scaling_factor: float, # 1.15258426 for 2B, 0.7 for 5B
version: str, # "1.0" or "1.5"
dtype: torch.dtype,
) -> AutoencoderKLCogVideoX
# LoRA weight export
def export_lora_weight(
ckpt_path: str,
lora_save_directory: str,
) -> None
Import
# As a script (most common usage)
python tools/convert_weight_sat2hf.py --transformer_ckpt_path ... --output_path ...
python tools/export_sat_lora_weight.py --sat_pt_path ... --lora_save_directory ...
# As a module (for programmatic use)
from tools.convert_weight_sat2hf import convert_transformer, convert_vae
from tools.export_sat_lora_weight import export_lora_weight
I/O Contract
Inputs (convert_weight_sat2hf.py)
| Parameter | Type | Required | Description |
|---|---|---|---|
--transformer_ckpt_path |
str | No | Path to SAT transformer checkpoint (.pt file). If omitted, transformer is not converted. |
--vae_ckpt_path |
str | No | Path to SAT VAE checkpoint (.pt file). If omitted, VAE is not converted. |
--output_path |
str | Yes | Directory path where the converted Diffusers pipeline will be saved. |
--num_layers |
int | No | Number of transformer blocks. Default: 30 (for 2B). Use 42 for 5B. |
--num_attention_heads |
int | No | Number of attention heads. Default: 30 (for 2B). Use 48 for 5B. |
--use_rotary_positional_embeddings |
flag | No | Enable RoPE. Default: False (for 2B). Set for 5B. |
--scaling_factor |
float | No | VAE scaling factor. Default: 1.15258426 (for 2B). Use 0.7 for 5B. |
--snr_shift_scale |
float | No | SNR shift scale for scheduler. Default: 3.0 (for 2B). Use 1.0 for 5B. |
--i2v |
flag | No | Convert as Image-to-Video model (uses 32 input channels instead of 16). |
--version |
str | No | CogVideoX version: "1.0" or "1.5". Default: "1.0".
|
--fp16 |
flag | No | Save model weights in float16 precision. |
--bf16 |
flag | No | Save model weights in bfloat16 precision. |
--text_encoder_cache_dir |
str | No | Path to cached T5-XXL text encoder weights. |
--push_to_hub |
flag | No | Push converted model to HuggingFace Hub after saving. |
Inputs (export_sat_lora_weight.py)
| Parameter | Type | Required | Description |
|---|---|---|---|
--sat_pt_path |
str | Yes | Path to SAT checkpoint containing LoRA weights. |
--lora_save_directory |
str | Yes | Directory path where pytorch_lora_weights.safetensors will be saved.
|
Outputs
| Output | Type | Description |
|---|---|---|
| Diffusers pipeline directory | Directory | Complete HuggingFace-format pipeline with model_index.json, transformer, VAE, tokenizer, text_encoder, and scheduler subdirectories. Files use safe serialization with 5GB shard limit.
|
| LoRA weights file | File | pytorch_lora_weights.safetensors in the specified save directory. Compatible with pipe.load_lora_weights().
|
Usage Examples
Convert Full CogVideoX-2B Transformer
python tools/convert_weight_sat2hf.py \
--transformer_ckpt_path ckpts/transformer/1000/mp_rank_00_model_states.pt \
--output_path output/cogvideox-2b-finetuned \
--num_layers 30 \
--num_attention_heads 30 \
--fp16 \
--text_encoder_cache_dir cache/t5-xxl
Convert CogVideoX-5B I2V Model
python tools/convert_weight_sat2hf.py \
--transformer_ckpt_path ckpts/transformer/1000/mp_rank_00_model_states.pt \
--output_path output/cogvideox-5b-i2v-finetuned \
--num_layers 42 \
--num_attention_heads 48 \
--use_rotary_positional_embeddings \
--i2v \
--bf16 \
--scaling_factor 0.7 \
--snr_shift_scale 1.0
Export LoRA Weights
python tools/export_sat_lora_weight.py \
--sat_pt_path ckpts_lora/transformer/1000/mp_rank_00_model_states.pt \
--lora_save_directory output/lora_weights
Load Converted LoRA Weights in Diffusers
from diffusers import CogVideoXPipeline
pipe = CogVideoXPipeline.from_pretrained("THUDM/CogVideoX-2b")
pipe.load_lora_weights("output/lora_weights")
External Dependencies
- diffusers: Provides
CogVideoXTransformer3DModel,AutoencoderKLCogVideoX,CogVideoXPipeline,CogVideoXImageToVideoPipeline,CogVideoXDDIMScheduler, andLoraBaseMixin. - transformers: Provides
T5TokenizerandT5EncoderModelfor text encoding. - torch: Checkpoint loading, tensor manipulation, and dtype conversion.
- safetensors: Safe serialization format for model weights.