Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Zai org CogVideo CogVideoXI2VPipeline Call

From Leeroopedia


Metadata

Field Value
Page Type Implementation (Wrapper Doc)
Knowledge Sources Repo (CogVideo), Paper (CogVideoX)
Domains Video_Generation, Diffusion_Models, Image_Conditioning
Last Updated 2026-02-10 00:00 GMT

Overview

Concrete tool for generating image-conditioned video using the CogVideoX I2V pipeline callable provided by the diffusers library.

Description

The I2V pipeline is called as a function (via __call__) with the text prompt, conditioning image, and generation parameters. Internally, the pipeline:

  1. Encodes the text prompt using the tokenizer and text encoder.
  2. Encodes the conditioning image using the VAE encoder.
  3. Initializes random noise latents for the video.
  4. Concatenates the image latent with the noise latent along the channel dimension.
  5. Iteratively denoises the combined latent using the transformer and scheduler for the specified number of inference steps.
  6. Decodes the final denoised latent using the VAE decoder to produce pixel-space video frames.

The output is a CogVideoXPipelineOutput object whose frames attribute contains a list of generated videos, where each video is a list of PIL Image objects (one per frame). Access the first (and typically only) video via output.frames[0].

Usage

Call the configured I2V pipeline with a prompt, image, and generation parameters. The pipeline must have been loaded and configured (scheduler + memory) before calling.

Code Reference

Source Location

inference/cli_demo.py, lines 157-169.

Signature

output = pipe(
    prompt,                                          # str: text description of desired video
    image=image,                                     # PIL.Image.Image: conditioning image
    height=height,                                   # int: 480 for 5B-I2V, custom for 1.5-5B-I2V
    width=width,                                     # int: 720 for 5B-I2V, custom for 1.5-5B-I2V
    num_frames=num_frames,                           # int: number of frames (default 81)
    num_inference_steps=num_inference_steps,          # int: denoising steps (default 50)
    guidance_scale=guidance_scale,                    # float: CFG scale (default 6.0)
    use_dynamic_cfg=True,                            # bool: dynamic guidance for DPM scheduler
    num_videos_per_prompt=num_videos_per_prompt,     # int: videos per prompt (default 1)
    generator=torch.Generator().manual_seed(seed),   # torch.Generator: for reproducibility
)
# Returns: CogVideoXPipelineOutput
# Access frames: output.frames[0] -> List[PIL.Image.Image]

Import

from diffusers import CogVideoXImageToVideoPipeline

I/O Contract

Inputs

Parameter Type Required Description
prompt str Yes Text description of the desired video content and motion.
image PIL.Image.Image Yes Conditioning image loaded via load_image. Serves as the visual anchor for the generated video.
height int Yes Height of the output video in pixels. 480 for CogVideoX-5b-I2V, 768 (or custom) for CogVideoX1.5-5b-I2V.
width int Yes Width of the output video in pixels. 720 for CogVideoX-5b-I2V, 1360 (or custom) for CogVideoX1.5-5b-I2V.
num_frames int No Number of video frames to generate. Default is 81 (approximately 5 seconds at 16 fps).
num_inference_steps int No Number of denoising steps. Default is 50. More steps yield higher quality but slower generation.
guidance_scale float No Classifier-free guidance scale. Default is 6.0. Higher values increase prompt adherence.
use_dynamic_cfg bool No Whether to use dynamic classifier-free guidance. Default is True when using the DPM scheduler.
num_videos_per_prompt int No Number of videos to generate per prompt. Default is 1.
generator torch.Generator No Random number generator for reproducibility. Use torch.Generator().manual_seed(seed).

Outputs

Output Type Description
Pipeline output CogVideoXPipelineOutput Output object containing generated video frames. Access via output.frames[0] to get a List[PIL.Image.Image] representing the video frames.

Usage Examples

Basic I2V Generation

import torch
from diffusers import CogVideoXImageToVideoPipeline, CogVideoXDPMScheduler
from diffusers.utils import load_image

# Load and configure pipeline
pipe = CogVideoXImageToVideoPipeline.from_pretrained(
    "THUDM/CogVideoX-5b-I2V", torch_dtype=torch.bfloat16
)
pipe.scheduler = CogVideoXDPMScheduler.from_config(
    pipe.scheduler.config, timestep_spacing="trailing"
)
pipe.enable_sequential_cpu_offload()
pipe.vae.enable_slicing()
pipe.vae.enable_tiling()

# Load conditioning image
image = load_image("/path/to/reference.png")

# Generate video
video_frames = pipe(
    prompt="A cat walking across a sunny garden",
    image=image,
    height=480,
    width=720,
    num_frames=81,
    num_inference_steps=50,
    guidance_scale=6.0,
    use_dynamic_cfg=True,
    num_videos_per_prompt=1,
    generator=torch.Generator().manual_seed(42),
).frames[0]

I2V Generation with CogVideoX1.5 at Custom Resolution

import torch
from diffusers import CogVideoXImageToVideoPipeline, CogVideoXDPMScheduler
from diffusers.utils import load_image

pipe = CogVideoXImageToVideoPipeline.from_pretrained(
    "THUDM/CogVideoX1.5-5b-I2V", torch_dtype=torch.bfloat16
)
pipe.scheduler = CogVideoXDPMScheduler.from_config(
    pipe.scheduler.config, timestep_spacing="trailing"
)
pipe.enable_sequential_cpu_offload()
pipe.vae.enable_slicing()
pipe.vae.enable_tiling()

image = load_image("/path/to/reference.png")

# Use custom resolution with CogVideoX1.5
video_frames = pipe(
    prompt="A bird taking flight from a tree branch",
    image=image,
    height=768,
    width=1360,
    num_frames=81,
    num_inference_steps=50,
    guidance_scale=6.0,
    use_dynamic_cfg=True,
    generator=torch.Generator().manual_seed(42),
).frames[0]

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment