Workflow:Zai org CogVideo Diffusers Image to Video Inference
| Knowledge Sources | |
|---|---|
| Domains | Video_Generation, Inference, Image_to_Video |
| Last Updated | 2026-02-10 12:00 GMT |
Overview
End-to-end process for generating videos from a static image and text prompt using CogVideoX image-to-video models via the HuggingFace Diffusers pipeline.
Description
This workflow covers the procedure for image-to-video (I2V) generation, where a source image is animated into a video clip guided by a text prompt. It uses the CogVideoXImageToVideoPipeline from Diffusers, which conditions the diffusion process on both the input image and text description. The image is encoded and concatenated with noise latents as the starting point for denoising. This enables controlled video generation where the first frame matches the input image and subsequent frames follow the motion described in the prompt.
Usage
Execute this workflow when you have a static image that you want to animate into a video. The text prompt guides the type of motion and scene evolution. This is useful for creating product animations, artistic video content from illustrations, or extending single photographs into dynamic scenes. Requires a CogVideoX I2V model variant (CogVideoX-5B-I2V or CogVideoX1.5-5B-I2V).
Execution Steps
Step 1: Model Loading
Load the CogVideoXImageToVideoPipeline from a pre-trained I2V model checkpoint. This pipeline includes the CogVideoX transformer with image conditioning support, the T5-XXL text encoder, and the 3D VAE. The I2V transformer architecture differs from T2V in that it accepts concatenated image and noise latents as input.
Key considerations:
- Must use an I2V-specific model: CogVideoX-5B-I2V or CogVideoX1.5-5B-I2V
- T2V models cannot be used for I2V (different architecture)
- Weights are loaded in bfloat16 precision
Step 2: Image and Prompt Preparation
Load the source image and prepare the text prompt. The image is loaded and resized to match the model's expected resolution. For CogVideoX-5B-I2V the resolution is 480x720; for CogVideoX1.5-5B-I2V, custom resolutions up to 768x1360 are supported. The text prompt describes the desired motion and scene evolution.
Key considerations:
- CogVideoX1.5-5B-I2V supports user-defined width and height
- CogVideoX-5B-I2V uses fixed 480x720 resolution
- Image aspect ratio should match target video dimensions
- Prompt should describe motion rather than static scene content
Step 3: Scheduler and Memory Configuration
Configure the DPM scheduler with trailing timestep spacing and apply memory optimizations. Enable sequential CPU offloading for minimal VRAM usage, and activate VAE slicing and tiling for efficient frame decoding.
Key considerations:
- Same scheduler and memory optimization options as T2V inference
- DPM scheduler is recommended for I2V models
- VAE tiling is especially important for CogVideoX1.5 higher resolution output
Step 4: Conditioned Video Generation
Run the I2V pipeline with the image and text prompt. The image is encoded to latent space, noise is added according to the diffusion schedule, and the transformer denoises the combined image-noise latent conditioned on the text embedding. The result is a video where the first frame closely resembles the input image.
Key considerations:
- `use_dynamic_cfg=True` is recommended for DPM scheduler
- The image conditions the generation to maintain visual consistency
- `num_frames` determines output length (49 or 81 frames depending on model)
- Guidance scale controls the balance between prompt adherence and visual quality
Step 5: Video Export
Export the generated video frames to an MP4 file at the configured frame rate.
Key considerations:
- Output resolution matches the model variant's configuration
- Frame rate is typically 16 fps for CogVideoX1.5, 8 fps for CogVideoX