Implementation:Zai org CogVideo CogVideoXI2VPipeline Call
Metadata
| Field | Value |
|---|---|
| Page Type | Implementation (Wrapper Doc) |
| Knowledge Sources | Repo (CogVideo), Paper (CogVideoX) |
| Domains | Video_Generation, Diffusion_Models, Image_Conditioning |
| Last Updated | 2026-02-10 00:00 GMT |
Overview
Concrete tool for generating image-conditioned video using the CogVideoX I2V pipeline callable provided by the diffusers library.
Description
The I2V pipeline is called as a function (via __call__) with the text prompt, conditioning image, and generation parameters. Internally, the pipeline:
- Encodes the text prompt using the tokenizer and text encoder.
- Encodes the conditioning image using the VAE encoder.
- Initializes random noise latents for the video.
- Concatenates the image latent with the noise latent along the channel dimension.
- Iteratively denoises the combined latent using the transformer and scheduler for the specified number of inference steps.
- Decodes the final denoised latent using the VAE decoder to produce pixel-space video frames.
The output is a CogVideoXPipelineOutput object whose frames attribute contains a list of generated videos, where each video is a list of PIL Image objects (one per frame). Access the first (and typically only) video via output.frames[0].
Usage
Call the configured I2V pipeline with a prompt, image, and generation parameters. The pipeline must have been loaded and configured (scheduler + memory) before calling.
Code Reference
Source Location
inference/cli_demo.py, lines 157-169.
Signature
output = pipe(
prompt, # str: text description of desired video
image=image, # PIL.Image.Image: conditioning image
height=height, # int: 480 for 5B-I2V, custom for 1.5-5B-I2V
width=width, # int: 720 for 5B-I2V, custom for 1.5-5B-I2V
num_frames=num_frames, # int: number of frames (default 81)
num_inference_steps=num_inference_steps, # int: denoising steps (default 50)
guidance_scale=guidance_scale, # float: CFG scale (default 6.0)
use_dynamic_cfg=True, # bool: dynamic guidance for DPM scheduler
num_videos_per_prompt=num_videos_per_prompt, # int: videos per prompt (default 1)
generator=torch.Generator().manual_seed(seed), # torch.Generator: for reproducibility
)
# Returns: CogVideoXPipelineOutput
# Access frames: output.frames[0] -> List[PIL.Image.Image]
Import
from diffusers import CogVideoXImageToVideoPipeline
I/O Contract
Inputs
| Parameter | Type | Required | Description |
|---|---|---|---|
prompt |
str | Yes | Text description of the desired video content and motion. |
image |
PIL.Image.Image | Yes | Conditioning image loaded via load_image. Serves as the visual anchor for the generated video.
|
height |
int | Yes | Height of the output video in pixels. 480 for CogVideoX-5b-I2V, 768 (or custom) for CogVideoX1.5-5b-I2V. |
width |
int | Yes | Width of the output video in pixels. 720 for CogVideoX-5b-I2V, 1360 (or custom) for CogVideoX1.5-5b-I2V. |
num_frames |
int | No | Number of video frames to generate. Default is 81 (approximately 5 seconds at 16 fps). |
num_inference_steps |
int | No | Number of denoising steps. Default is 50. More steps yield higher quality but slower generation. |
guidance_scale |
float | No | Classifier-free guidance scale. Default is 6.0. Higher values increase prompt adherence. |
use_dynamic_cfg |
bool | No | Whether to use dynamic classifier-free guidance. Default is True when using the DPM scheduler.
|
num_videos_per_prompt |
int | No | Number of videos to generate per prompt. Default is 1. |
generator |
torch.Generator | No | Random number generator for reproducibility. Use torch.Generator().manual_seed(seed).
|
Outputs
| Output | Type | Description |
|---|---|---|
| Pipeline output | CogVideoXPipelineOutput |
Output object containing generated video frames. Access via output.frames[0] to get a List[PIL.Image.Image] representing the video frames.
|
Usage Examples
Basic I2V Generation
import torch
from diffusers import CogVideoXImageToVideoPipeline, CogVideoXDPMScheduler
from diffusers.utils import load_image
# Load and configure pipeline
pipe = CogVideoXImageToVideoPipeline.from_pretrained(
"THUDM/CogVideoX-5b-I2V", torch_dtype=torch.bfloat16
)
pipe.scheduler = CogVideoXDPMScheduler.from_config(
pipe.scheduler.config, timestep_spacing="trailing"
)
pipe.enable_sequential_cpu_offload()
pipe.vae.enable_slicing()
pipe.vae.enable_tiling()
# Load conditioning image
image = load_image("/path/to/reference.png")
# Generate video
video_frames = pipe(
prompt="A cat walking across a sunny garden",
image=image,
height=480,
width=720,
num_frames=81,
num_inference_steps=50,
guidance_scale=6.0,
use_dynamic_cfg=True,
num_videos_per_prompt=1,
generator=torch.Generator().manual_seed(42),
).frames[0]
I2V Generation with CogVideoX1.5 at Custom Resolution
import torch
from diffusers import CogVideoXImageToVideoPipeline, CogVideoXDPMScheduler
from diffusers.utils import load_image
pipe = CogVideoXImageToVideoPipeline.from_pretrained(
"THUDM/CogVideoX1.5-5b-I2V", torch_dtype=torch.bfloat16
)
pipe.scheduler = CogVideoXDPMScheduler.from_config(
pipe.scheduler.config, timestep_spacing="trailing"
)
pipe.enable_sequential_cpu_offload()
pipe.vae.enable_slicing()
pipe.vae.enable_tiling()
image = load_image("/path/to/reference.png")
# Use custom resolution with CogVideoX1.5
video_frames = pipe(
prompt="A bird taking flight from a tree branch",
image=image,
height=768,
width=1360,
num_frames=81,
num_inference_steps=50,
guidance_scale=6.0,
use_dynamic_cfg=True,
generator=torch.Generator().manual_seed(42),
).frames[0]