Principle:Zai org CogVideo Text to Video Generation
Overview
Technique for generating video frames from a text description using iterative denoising of latent representations conditioned on text embeddings.
Description
Text-to-video generation uses the full diffusion pipeline to transform a text prompt into a sequence of video frames. The process involves multiple stages:
- Text encoding -- The text prompt is tokenized and encoded into conditioning embeddings using the T5 text encoder
- Latent initialization -- Random noise is sampled in the latent space with dimensions determined by the target resolution and frame count
- Iterative denoising -- The transformer model iteratively refines the noisy latents over multiple timesteps, guided by the text conditioning and classifier-free guidance
- Dynamic CFG -- The guidance scale is varied during sampling for improved temporal coherence and visual quality
- VAE decoding -- The final denoised latents are decoded by the VAE into pixel-space video frames
The generation process is controlled by several key parameters including the number of inference steps (controlling quality/speed tradeoff), guidance scale (controlling prompt adherence), and seed (controlling reproducibility).
Usage
Use when generating video content from text descriptions. Key parameter recommendations:
| Parameter | Default | Guidance |
|---|---|---|
| num_inference_steps | 50 | Higher = better quality, slower. 30-50 is typical range. |
| guidance_scale | 6.0 | Higher = stronger prompt adherence. 6.0 is recommended default. |
| use_dynamic_cfg | True | Recommended for better temporal coherence. |
| num_frames | 49 (1.0) / 81 (1.5) | Must satisfy model constraints: 49 for CogVideoX-1.0, 81 for CogVideoX-1.5. |
Theoretical Basis
Classifier-Free Guidance
Classifier-free guidance (CFG) steers generation toward the text conditioning by combining conditional and unconditional predictions:
- epsilon_guided = epsilon_uncond + scale * (epsilon_cond - epsilon_uncond)
Where:
- epsilon_uncond is the model prediction without text conditioning
- epsilon_cond is the model prediction with text conditioning
- scale is the guidance scale (higher values produce stronger adherence to the prompt)
Dynamic CFG
Dynamic CFG varies the guidance scale during the sampling process rather than using a fixed value. This typically uses a higher scale in early steps (for coarse structure) and a lower scale in later steps (for fine details and temporal coherence).
Frame Count Constraints
The number of frames must satisfy model architecture constraints:
- CogVideoX-1.0 (2B, 5B): 49 frames
- CogVideoX-1.5 (1.5-5B): 81 frames
These values correspond to the temporal dimension that the 3D transformer was trained on.