Workflow:Zai org CogVideo SAT Video Generation
| Knowledge Sources | |
|---|---|
| Domains | Video_Generation, Inference, SAT_Framework |
| Last Updated | 2026-02-10 12:00 GMT |
Overview
End-to-end process for generating videos from text or image prompts using CogVideoX models via the SwissArmyTransformer (SAT) inference pipeline.
Description
This workflow covers video generation using the SAT framework's native inference pipeline. It loads the CogVideoX model architecture directly through SAT's model loading system, configures the sampling parameters via YAML configs, and generates videos using the full SGM diffusion sampling pipeline. The SAT inference path provides direct access to all sampling strategies (EulerEDM, DPMPP2M, etc.) and supports both text-to-video and image-to-video generation. Prompts can be provided interactively via CLI or in batch from a text file.
Usage
Execute this workflow when you want to generate videos using SAT-format model weights (the native format of CogVideoX), need access to advanced sampling parameters not exposed by the Diffusers pipeline, or are working within the SAT ecosystem. This path is also used for evaluating SAT-trained fine-tuned models before converting weights to HuggingFace format.
Execution Steps
Step 1: Environment and Configuration
Set up the SAT environment and select the model and inference configuration files. The model YAML config defines the architecture and weight paths, while the inference YAML config specifies sampling parameters (number of steps, guidance scale, output dimensions). Set CUDA environment variables for single-GPU inference.
Key considerations:
- Model config selects the variant: cogvideox_2b.yaml, cogvideox_5b.yaml, cogvideox1.5_5b.yaml, etc.
- Inference config (`inference.yaml`) sets num_steps, guidance scale, and output paths
- I2V configs are separate (e.g., cogvideox_5b_i2v.yaml)
- LoRA configs are available for fine-tuned model inference
Step 2: Model Loading
Initialize the SATVideoDiffusionEngine and load pre-trained weights from a checkpoint. The engine instantiates the DiffusionTransformer backbone, the 3D VAE (first stage model), the T5 text conditioner, and all SGM components (sampler, denoiser, discretizer, guider). Weights are loaded via SAT's `load_checkpoint` utility.
Key considerations:
- Model architecture is instantiated from the YAML config specification
- Checkpoint loading supports both full and LoRA-augmented weights
- The T5 encoder is loaded with a maximum sequence length of 224-226 tokens
- All components are moved to GPU in the appropriate precision
Step 3: Prompt Input
Provide text prompts for video generation. Prompts can be entered interactively via the command line or loaded from a text file for batch generation. For I2V generation, an image path is also specified in the configuration. The prompt is encoded by the T5-XXL conditioner into embedding vectors that guide the diffusion process.
Key considerations:
- CLI mode: interactive input one prompt at a time
- File mode: batch processing from a text file (e.g., `configs/test.txt`)
- Prompts are distributed across GPUs in multi-GPU setups
- Negative prompts can be specified for classifier-free guidance
Step 4: Diffusion Sampling
Generate video latents through the iterative denoising process. The sampler (EulerEDM discretization by default) progressively denoises random Gaussian noise conditioned on the text embeddings. The denoiser applies preconditioning scaling, the guider implements classifier-free guidance, and the discretizer maps between continuous and discrete sigma schedules.
Key considerations:
- Default sampler uses EulerEDM with configurable number of steps
- Classifier-free guidance scale controls prompt adherence
- The sampling process operates in the compressed latent space of the 3D VAE
- Image conditioning (for I2V) concatenates image latents with noise
Step 5: Video Decoding and Export
Decode the generated latent tensor back to pixel space using the 3D VAE decoder, then export as an MP4 video file. The decoder reconstructs full-resolution video frames from the compressed latent representation. Videos are saved with unique filenames based on the generation index.
Key considerations:
- VAE decoding uses context parallelism for long videos
- Output frames are clipped to [-1, 1] range and rescaled to [0, 255]
- Videos are saved as MP4 using imageio at the configured FPS
- Wandb logging can record generated videos for experiment tracking