Principle:Zai org CogVideo Text to Video Generation

Overview

Technique for generating video frames from a text description using iterative denoising of latent representations conditioned on text embeddings.

Description

Text-to-video generation uses the full diffusion pipeline to transform a text prompt into a sequence of video frames. The process involves multiple stages:

Text encoding -- The text prompt is tokenized and encoded into conditioning embeddings using the T5 text encoder
Latent initialization -- Random noise is sampled in the latent space with dimensions determined by the target resolution and frame count
Iterative denoising -- The transformer model iteratively refines the noisy latents over multiple timesteps, guided by the text conditioning and classifier-free guidance
Dynamic CFG -- The guidance scale is varied during sampling for improved temporal coherence and visual quality
VAE decoding -- The final denoised latents are decoded by the VAE into pixel-space video frames

The generation process is controlled by several key parameters including the number of inference steps (controlling quality/speed tradeoff), guidance scale (controlling prompt adherence), and seed (controlling reproducibility).

Usage

Use when generating video content from text descriptions. Key parameter recommendations:

Parameter	Default	Guidance
num_inference_steps	50	Higher = better quality, slower. 30-50 is typical range.
guidance_scale	6.0	Higher = stronger prompt adherence. 6.0 is recommended default.
use_dynamic_cfg	True	Recommended for better temporal coherence.
num_frames	49 (1.0) / 81 (1.5)	Must satisfy model constraints: 49 for CogVideoX-1.0, 81 for CogVideoX-1.5.

Theoretical Basis

Classifier-Free Guidance

Classifier-free guidance (CFG) steers generation toward the text conditioning by combining conditional and unconditional predictions:

epsilon_guided = epsilon_uncond + scale * (epsilon_cond - epsilon_uncond)

Where:

epsilon_uncond is the model prediction without text conditioning
epsilon_cond is the model prediction with text conditioning
scale is the guidance scale (higher values produce stronger adherence to the prompt)

Dynamic CFG

Dynamic CFG varies the guidance scale during the sampling process rather than using a fixed value. This typically uses a higher scale in early steps (for coarse structure) and a lower scale in later steps (for fine details and temporal coherence).

Frame Count Constraints

The number of frames must satisfy model architecture constraints:

CogVideoX-1.0 (2B, 5B): 49 frames
CogVideoX-1.5 (1.5-5B): 81 frames

These values correspond to the temporal dimension that the 3D transformer was trained on.

Knowledge Sources

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment