Implementation:Zai org CogVideo CogVideoXI2VPipeline Call

Metadata

Field	Value
Page Type	Implementation (Wrapper Doc)
Knowledge Sources	Repo (CogVideo), Paper (CogVideoX)
Domains	Video_Generation, Diffusion_Models, Image_Conditioning
Last Updated	2026-02-10 00:00 GMT

Overview

Concrete tool for generating image-conditioned video using the CogVideoX I2V pipeline callable provided by the diffusers library.

Description

The I2V pipeline is called as a function (via __call__) with the text prompt, conditioning image, and generation parameters. Internally, the pipeline:

Encodes the text prompt using the tokenizer and text encoder.
Encodes the conditioning image using the VAE encoder.
Initializes random noise latents for the video.
Concatenates the image latent with the noise latent along the channel dimension.
Iteratively denoises the combined latent using the transformer and scheduler for the specified number of inference steps.
Decodes the final denoised latent using the VAE decoder to produce pixel-space video frames.

The output is a CogVideoXPipelineOutput object whose frames attribute contains a list of generated videos, where each video is a list of PIL Image objects (one per frame). Access the first (and typically only) video via output.frames[0].

Usage

Call the configured I2V pipeline with a prompt, image, and generation parameters. The pipeline must have been loaded and configured (scheduler + memory) before calling.

Code Reference

Source Location

inference/cli_demo.py, lines 157-169.

Signature

output = pipe(
    prompt,                                          # str: text description of desired video
    image=image,                                     # PIL.Image.Image: conditioning image
    height=height,                                   # int: 480 for 5B-I2V, custom for 1.5-5B-I2V
    width=width,                                     # int: 720 for 5B-I2V, custom for 1.5-5B-I2V
    num_frames=num_frames,                           # int: number of frames (default 81)
    num_inference_steps=num_inference_steps,          # int: denoising steps (default 50)
    guidance_scale=guidance_scale,                    # float: CFG scale (default 6.0)
    use_dynamic_cfg=True,                            # bool: dynamic guidance for DPM scheduler
    num_videos_per_prompt=num_videos_per_prompt,     # int: videos per prompt (default 1)
    generator=torch.Generator().manual_seed(seed),   # torch.Generator: for reproducibility
)
# Returns: CogVideoXPipelineOutput
# Access frames: output.frames[0] -> List[PIL.Image.Image]

Import

from diffusers import CogVideoXImageToVideoPipeline

I/O Contract

Inputs

Parameter	Type	Required	Description
`prompt`	str	Yes	Text description of the desired video content and motion.
`image`	PIL.Image.Image	Yes	Conditioning image loaded via `load_image`. Serves as the visual anchor for the generated video.
`height`	int	Yes	Height of the output video in pixels. 480 for CogVideoX-5b-I2V, 768 (or custom) for CogVideoX1.5-5b-I2V.
`width`	int	Yes	Width of the output video in pixels. 720 for CogVideoX-5b-I2V, 1360 (or custom) for CogVideoX1.5-5b-I2V.
`num_frames`	int	No	Number of video frames to generate. Default is 81 (approximately 5 seconds at 16 fps).
`num_inference_steps`	int	No	Number of denoising steps. Default is 50. More steps yield higher quality but slower generation.
`guidance_scale`	float	No	Classifier-free guidance scale. Default is 6.0. Higher values increase prompt adherence.
`use_dynamic_cfg`	bool	No	Whether to use dynamic classifier-free guidance. Default is `True` when using the DPM scheduler.
`num_videos_per_prompt`	int	No	Number of videos to generate per prompt. Default is 1.
`generator`	torch.Generator	No	Random number generator for reproducibility. Use `torch.Generator().manual_seed(seed)`.

Outputs

Output	Type	Description
Pipeline output	`CogVideoXPipelineOutput`	Output object containing generated video frames. Access via `output.frames[0]` to get a `List[PIL.Image.Image]` representing the video frames.

Usage Examples

Basic I2V Generation

import torch
from diffusers import CogVideoXImageToVideoPipeline, CogVideoXDPMScheduler
from diffusers.utils import load_image

# Load and configure pipeline
pipe = CogVideoXImageToVideoPipeline.from_pretrained(
    "THUDM/CogVideoX-5b-I2V", torch_dtype=torch.bfloat16
)
pipe.scheduler = CogVideoXDPMScheduler.from_config(
    pipe.scheduler.config, timestep_spacing="trailing"
)
pipe.enable_sequential_cpu_offload()
pipe.vae.enable_slicing()
pipe.vae.enable_tiling()

# Load conditioning image
image = load_image("/path/to/reference.png")

# Generate video
video_frames = pipe(
    prompt="A cat walking across a sunny garden",
    image=image,
    height=480,
    width=720,
    num_frames=81,
    num_inference_steps=50,
    guidance_scale=6.0,
    use_dynamic_cfg=True,
    num_videos_per_prompt=1,
    generator=torch.Generator().manual_seed(42),
).frames[0]

I2V Generation with CogVideoX1.5 at Custom Resolution

import torch
from diffusers import CogVideoXImageToVideoPipeline, CogVideoXDPMScheduler
from diffusers.utils import load_image

pipe = CogVideoXImageToVideoPipeline.from_pretrained(
    "THUDM/CogVideoX1.5-5b-I2V", torch_dtype=torch.bfloat16
)
pipe.scheduler = CogVideoXDPMScheduler.from_config(
    pipe.scheduler.config, timestep_spacing="trailing"
)
pipe.enable_sequential_cpu_offload()
pipe.vae.enable_slicing()
pipe.vae.enable_tiling()

image = load_image("/path/to/reference.png")

# Use custom resolution with CogVideoX1.5
video_frames = pipe(
    prompt="A bird taking flight from a tree branch",
    image=image,
    height=768,
    width=1360,
    num_frames=81,
    num_inference_steps=50,
    guidance_scale=6.0,
    use_dynamic_cfg=True,
    generator=torch.Generator().manual_seed(42),
).frames[0]

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment