Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:Zai org CogVideo Text to Video Generation

From Leeroopedia


Template:Principle

Overview

Technique for generating video frames from a text description using iterative denoising of latent representations conditioned on text embeddings.

Description

Text-to-video generation uses the full diffusion pipeline to transform a text prompt into a sequence of video frames. The process involves multiple stages:

  1. Text encoding -- The text prompt is tokenized and encoded into conditioning embeddings using the T5 text encoder
  2. Latent initialization -- Random noise is sampled in the latent space with dimensions determined by the target resolution and frame count
  3. Iterative denoising -- The transformer model iteratively refines the noisy latents over multiple timesteps, guided by the text conditioning and classifier-free guidance
  4. Dynamic CFG -- The guidance scale is varied during sampling for improved temporal coherence and visual quality
  5. VAE decoding -- The final denoised latents are decoded by the VAE into pixel-space video frames

The generation process is controlled by several key parameters including the number of inference steps (controlling quality/speed tradeoff), guidance scale (controlling prompt adherence), and seed (controlling reproducibility).

Usage

Use when generating video content from text descriptions. Key parameter recommendations:

Parameter Default Guidance
num_inference_steps 50 Higher = better quality, slower. 30-50 is typical range.
guidance_scale 6.0 Higher = stronger prompt adherence. 6.0 is recommended default.
use_dynamic_cfg True Recommended for better temporal coherence.
num_frames 49 (1.0) / 81 (1.5) Must satisfy model constraints: 49 for CogVideoX-1.0, 81 for CogVideoX-1.5.

Theoretical Basis

Classifier-Free Guidance

Classifier-free guidance (CFG) steers generation toward the text conditioning by combining conditional and unconditional predictions:

epsilon_guided = epsilon_uncond + scale * (epsilon_cond - epsilon_uncond)

Where:

  • epsilon_uncond is the model prediction without text conditioning
  • epsilon_cond is the model prediction with text conditioning
  • scale is the guidance scale (higher values produce stronger adherence to the prompt)

Dynamic CFG

Dynamic CFG varies the guidance scale during the sampling process rather than using a fixed value. This typically uses a higher scale in early steps (for coarse structure) and a lower scale in later steps (for fine details and temporal coherence).

Frame Count Constraints

The number of frames must satisfy model architecture constraints:

  • CogVideoX-1.0 (2B, 5B): 49 frames
  • CogVideoX-1.5 (1.5-5B): 81 frames

These values correspond to the temporal dimension that the 3D transformer was trained on.

Knowledge Sources

Related

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment