Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Zai org CogVideo CogVideoXI2VPipeline From Pretrained

From Leeroopedia


Metadata

Field Value
Page Type Implementation (Wrapper Doc)
Knowledge Sources Repo (CogVideo), Paper (CogVideoX)
Domains Video_Generation, Diffusion_Models, Image_Conditioning
Last Updated 2026-02-10 00:00 GMT

Overview

Concrete tool for loading the CogVideoX image-to-video pipeline from pretrained weights provided by the diffusers library.

Description

CogVideoXImageToVideoPipeline.from_pretrained is the factory method that downloads (or loads from cache) and initializes all components of the I2V pipeline from a Hugging Face model identifier. It loads the tokenizer, text encoder, I2V-specific transformer, VAE, and scheduler into a single callable pipeline object. The I2V transformer variant has 32 input channels (double the T2V variant's 16 channels) to accommodate the concatenated image latent.

Two supported model identifiers correspond to different resolution and architecture variants:

  • THUDM/CogVideoX-5b-I2V: Generates video at 480x720 resolution.
  • THUDM/CogVideoX1.5-5b-I2V: Generates video at 768x1360 resolution (custom resolution also supported).

Usage

Import CogVideoXImageToVideoPipeline from the diffusers library and call from_pretrained with an I2V model identifier. The returned pipeline object is ready for scheduler configuration, memory optimization, and video generation.

Code Reference

Source Location

inference/cli_demo.py, line 119.

Signature

pipe = CogVideoXImageToVideoPipeline.from_pretrained(
    model_path,          # str: "THUDM/CogVideoX-5b-I2V" or "THUDM/CogVideoX1.5-5b-I2V"
    torch_dtype=dtype,   # torch.dtype: torch.bfloat16 (recommended)
)
# Returns: CogVideoXImageToVideoPipeline

Import

from diffusers import CogVideoXImageToVideoPipeline

I/O Contract

Inputs

Parameter Type Required Description
model_path str Yes Hugging Face model identifier for the I2V variant. Supported values: "THUDM/CogVideoX-5b-I2V", "THUDM/CogVideoX1.5-5b-I2V".
torch_dtype torch.dtype No Data type for model weights. Defaults to the model's saved dtype. torch.bfloat16 is recommended for memory efficiency and speed.

Outputs

Output Type Description
Pipeline instance CogVideoXImageToVideoPipeline A fully initialized I2V pipeline with tokenizer, text encoder, transformer, VAE, and scheduler loaded. Ready for scheduler configuration and inference.

Supported Models and Resolutions

Model Identifier Default Resolution (HxW) Notes
THUDM/CogVideoX-5b-I2V 480 x 720 Fixed resolution.
THUDM/CogVideoX1.5-5b-I2V 768 x 1360 Custom resolution supported.

Usage Examples

Loading the CogVideoX-5b-I2V Pipeline

import torch
from diffusers import CogVideoXImageToVideoPipeline

pipe = CogVideoXImageToVideoPipeline.from_pretrained(
    "THUDM/CogVideoX-5b-I2V",
    torch_dtype=torch.bfloat16,
)

Loading the CogVideoX1.5-5b-I2V Pipeline

import torch
from diffusers import CogVideoXImageToVideoPipeline

pipe = CogVideoXImageToVideoPipeline.from_pretrained(
    "THUDM/CogVideoX1.5-5b-I2V",
    torch_dtype=torch.bfloat16,
)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment