Principle:Zai org CogVideo Image to Video Generation

Metadata

Field	Value
Page Type	Principle
Knowledge Sources	Repo (CogVideo), Paper (CogVideoX)
Domains	Video_Generation, Diffusion_Models, Image_Conditioning
Last Updated	2026-02-10 00:00 GMT

Overview

Technique for generating video frames conditioned on both a text prompt and a reference image using iterative latent denoising.

Description

Image-to-video generation extends text-to-video by incorporating an input image as visual conditioning. The image is encoded into latent space by the VAE and concatenated with random noise latents. The transformer then denoises the combined representation while maintaining visual consistency with the input image. The text prompt guides the semantic content and motion of the generated video.

Generation Process

The I2V generation process proceeds through the following stages:

Image encoding: The conditioning image is encoded by the VAE into a latent representation z_img.
Noise initialization: Random Gaussian noise is sampled as the initial video latent z_T, with shape determined by the target height, width, and number of frames.
Latent concatenation: The image latent z_img is concatenated with the noisy video latent z_t along the channel dimension at each denoising step.
Text encoding: The text prompt is encoded into a sequence of text embeddings by the text encoder.
Iterative denoising: The transformer processes the concatenated latent with text conditioning for N inference steps, progressively removing noise.
Latent decoding: The denoised video latent is decoded by the VAE into a sequence of pixel-space video frames.

Key Parameters

num_frames: Controls the number of video frames generated (default 81, corresponding to approximately 5 seconds at 16 fps).
num_inference_steps: Controls the number of denoising iterations (default 50). More steps generally produce higher quality but increase computation time.
guidance_scale: Controls the strength of classifier-free guidance (default 6.0). Higher values increase prompt adherence but may reduce diversity and naturalness.
use_dynamic_cfg: When enabled with the DPM scheduler, applies a dynamic classifier-free guidance schedule that varies the guidance strength across denoising steps for improved quality.

Usage

Use when you want to animate a still image based on a text description. The generated video will maintain visual consistency with the input image. The I2V pipeline call requires:

A configured I2V pipeline (with scheduler and memory settings applied).
A text prompt describing the desired video content and motion.
A conditioning image (as a PIL Image object).
Generation parameters specifying resolution, frame count, and quality settings.

Theoretical Basis

Image-Conditioned Denoising

The I2V model concatenates image latents z_img with noisy video latents z_t along channels:

input = [z_t; z_img]

The transformer processes this joint representation with text conditioning c:

epsilon_theta([z_t; z_img], t, c)

where epsilon_theta is the noise prediction network, t is the current timestep, and c is the text embedding. The model learns to predict the noise component while preserving the structural information encoded in z_img.

Classifier-Free Guidance

Classifier-free guidance (CFG) applies to both text and image conditions. During training, the text condition is randomly dropped with some probability, enabling the model to generate both conditionally and unconditionally. At inference, the final noise prediction is computed as:

epsilon = epsilon_uncond + guidance_scale * (epsilon_cond - epsilon_uncond)

where epsilon_uncond is the prediction with empty text conditioning and epsilon_cond is the prediction with the actual text prompt. The image conditioning is always present (not dropped) during I2V inference, ensuring visual consistency is maintained regardless of the guidance scale.

Dynamic CFG

When use_dynamic_cfg=True, the guidance scale is varied across denoising steps rather than held constant. This allows stronger guidance in the early (high-noise) steps where global structure is determined, and weaker guidance in later (low-noise) steps where fine details are refined. This improves overall sample quality and reduces artifacts compared to a fixed guidance scale.

Resolution and Frame Count

The model generates video at the resolution specified by height and width, with the number of temporal frames controlled by num_frames. The default of 81 frames at 16 fps yields approximately 5 seconds of video. For CogVideoX1.5-5B-I2V, custom resolutions are supported, allowing generation at resolutions different from the default 768x1360.

Related Pages

Implementation:Zai_org_CogVideo_CogVideoXI2VPipeline_Call

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment