Workflow:Zai org CogVideo SAT Video Generation

Knowledge Sources	CogVideo SAT Inference Guide SwissArmyTransformer
Domains	Video_Generation, Inference, SAT_Framework
Last Updated	2026-02-10 12:00 GMT

Overview

End-to-end process for generating videos from text or image prompts using CogVideoX models via the SwissArmyTransformer (SAT) inference pipeline.

Description

This workflow covers video generation using the SAT framework's native inference pipeline. It loads the CogVideoX model architecture directly through SAT's model loading system, configures the sampling parameters via YAML configs, and generates videos using the full SGM diffusion sampling pipeline. The SAT inference path provides direct access to all sampling strategies (EulerEDM, DPMPP2M, etc.) and supports both text-to-video and image-to-video generation. Prompts can be provided interactively via CLI or in batch from a text file.

Usage

Execute this workflow when you want to generate videos using SAT-format model weights (the native format of CogVideoX), need access to advanced sampling parameters not exposed by the Diffusers pipeline, or are working within the SAT ecosystem. This path is also used for evaluating SAT-trained fine-tuned models before converting weights to HuggingFace format.

Execution Steps

Step 1: Environment and Configuration

Set up the SAT environment and select the model and inference configuration files. The model YAML config defines the architecture and weight paths, while the inference YAML config specifies sampling parameters (number of steps, guidance scale, output dimensions). Set CUDA environment variables for single-GPU inference.

Key considerations:

Model config selects the variant: cogvideox_2b.yaml, cogvideox_5b.yaml, cogvideox1.5_5b.yaml, etc.
Inference config (`inference.yaml`) sets num_steps, guidance scale, and output paths
I2V configs are separate (e.g., cogvideox_5b_i2v.yaml)
LoRA configs are available for fine-tuned model inference

Step 2: Model Loading

Initialize the SATVideoDiffusionEngine and load pre-trained weights from a checkpoint. The engine instantiates the DiffusionTransformer backbone, the 3D VAE (first stage model), the T5 text conditioner, and all SGM components (sampler, denoiser, discretizer, guider). Weights are loaded via SAT's `load_checkpoint` utility.

Key considerations:

Model architecture is instantiated from the YAML config specification
Checkpoint loading supports both full and LoRA-augmented weights
The T5 encoder is loaded with a maximum sequence length of 224-226 tokens
All components are moved to GPU in the appropriate precision

Step 3: Prompt Input

Provide text prompts for video generation. Prompts can be entered interactively via the command line or loaded from a text file for batch generation. For I2V generation, an image path is also specified in the configuration. The prompt is encoded by the T5-XXL conditioner into embedding vectors that guide the diffusion process.

Key considerations:

CLI mode: interactive input one prompt at a time
File mode: batch processing from a text file (e.g., `configs/test.txt`)
Prompts are distributed across GPUs in multi-GPU setups
Negative prompts can be specified for classifier-free guidance

Step 4: Diffusion Sampling

Generate video latents through the iterative denoising process. The sampler (EulerEDM discretization by default) progressively denoises random Gaussian noise conditioned on the text embeddings. The denoiser applies preconditioning scaling, the guider implements classifier-free guidance, and the discretizer maps between continuous and discrete sigma schedules.

Key considerations:

Default sampler uses EulerEDM with configurable number of steps
Classifier-free guidance scale controls prompt adherence
The sampling process operates in the compressed latent space of the 3D VAE
Image conditioning (for I2V) concatenates image latents with noise

Step 5: Video Decoding and Export

Decode the generated latent tensor back to pixel space using the 3D VAE decoder, then export as an MP4 video file. The decoder reconstructs full-resolution video frames from the compressed latent representation. Videos are saved with unique filenames based on the generation index.

Key considerations:

VAE decoding uses context parallelism for long videos
Output frames are clipped to [-1, 1] range and rescaled to [0, 255]
Videos are saved as MP4 using imageio at the configured FPS
Wandb logging can record generated videos for experiment tracking

Execution Diagram

GitHub URL

Workflow Repository