Workflow:Zai org CogVideo Diffusers Image to Video Inference

Knowledge Sources	CogVideo HuggingFace Diffusers CogVideoX I2V Models
Domains	Video_Generation, Inference, Image_to_Video
Last Updated	2026-02-10 12:00 GMT

Overview

End-to-end process for generating videos from a static image and text prompt using CogVideoX image-to-video models via the HuggingFace Diffusers pipeline.

Description

This workflow covers the procedure for image-to-video (I2V) generation, where a source image is animated into a video clip guided by a text prompt. It uses the CogVideoXImageToVideoPipeline from Diffusers, which conditions the diffusion process on both the input image and text description. The image is encoded and concatenated with noise latents as the starting point for denoising. This enables controlled video generation where the first frame matches the input image and subsequent frames follow the motion described in the prompt.

Usage

Execute this workflow when you have a static image that you want to animate into a video. The text prompt guides the type of motion and scene evolution. This is useful for creating product animations, artistic video content from illustrations, or extending single photographs into dynamic scenes. Requires a CogVideoX I2V model variant (CogVideoX-5B-I2V or CogVideoX1.5-5B-I2V).

Execution Steps

Step 1: Model Loading

Load the CogVideoXImageToVideoPipeline from a pre-trained I2V model checkpoint. This pipeline includes the CogVideoX transformer with image conditioning support, the T5-XXL text encoder, and the 3D VAE. The I2V transformer architecture differs from T2V in that it accepts concatenated image and noise latents as input.

Key considerations:

Must use an I2V-specific model: CogVideoX-5B-I2V or CogVideoX1.5-5B-I2V
T2V models cannot be used for I2V (different architecture)
Weights are loaded in bfloat16 precision

Step 2: Image and Prompt Preparation

Load the source image and prepare the text prompt. The image is loaded and resized to match the model's expected resolution. For CogVideoX-5B-I2V the resolution is 480x720; for CogVideoX1.5-5B-I2V, custom resolutions up to 768x1360 are supported. The text prompt describes the desired motion and scene evolution.

Key considerations:

CogVideoX1.5-5B-I2V supports user-defined width and height
CogVideoX-5B-I2V uses fixed 480x720 resolution
Image aspect ratio should match target video dimensions
Prompt should describe motion rather than static scene content

Step 3: Scheduler and Memory Configuration

Configure the DPM scheduler with trailing timestep spacing and apply memory optimizations. Enable sequential CPU offloading for minimal VRAM usage, and activate VAE slicing and tiling for efficient frame decoding.

Key considerations:

Same scheduler and memory optimization options as T2V inference
DPM scheduler is recommended for I2V models
VAE tiling is especially important for CogVideoX1.5 higher resolution output

Step 4: Conditioned Video Generation

Run the I2V pipeline with the image and text prompt. The image is encoded to latent space, noise is added according to the diffusion schedule, and the transformer denoises the combined image-noise latent conditioned on the text embedding. The result is a video where the first frame closely resembles the input image.

Key considerations:

`use_dynamic_cfg=True` is recommended for DPM scheduler
The image conditions the generation to maintain visual consistency
`num_frames` determines output length (49 or 81 frames depending on model)
Guidance scale controls the balance between prompt adherence and visual quality

Step 5: Video Export

Export the generated video frames to an MP4 file at the configured frame rate.

Key considerations:

Output resolution matches the model variant's configuration
Frame rate is typically 16 fps for CogVideoX1.5, 8 fps for CogVideoX

Execution Diagram

GitHub URL

Workflow Repository