Implementation:Zai org CogVideo CogVideoXI2VPipeline From Pretrained

Metadata

Field	Value
Page Type	Implementation (Wrapper Doc)
Knowledge Sources	Repo (CogVideo), Paper (CogVideoX)
Domains	Video_Generation, Diffusion_Models, Image_Conditioning
Last Updated	2026-02-10 00:00 GMT

Overview

Concrete tool for loading the CogVideoX image-to-video pipeline from pretrained weights provided by the diffusers library.

Description

CogVideoXImageToVideoPipeline.from_pretrained is the factory method that downloads (or loads from cache) and initializes all components of the I2V pipeline from a Hugging Face model identifier. It loads the tokenizer, text encoder, I2V-specific transformer, VAE, and scheduler into a single callable pipeline object. The I2V transformer variant has 32 input channels (double the T2V variant's 16 channels) to accommodate the concatenated image latent.

Two supported model identifiers correspond to different resolution and architecture variants:

THUDM/CogVideoX-5b-I2V: Generates video at 480x720 resolution.
THUDM/CogVideoX1.5-5b-I2V: Generates video at 768x1360 resolution (custom resolution also supported).

Usage

Import CogVideoXImageToVideoPipeline from the diffusers library and call from_pretrained with an I2V model identifier. The returned pipeline object is ready for scheduler configuration, memory optimization, and video generation.

Code Reference

Source Location

inference/cli_demo.py, line 119.

Signature

pipe = CogVideoXImageToVideoPipeline.from_pretrained(
    model_path,          # str: "THUDM/CogVideoX-5b-I2V" or "THUDM/CogVideoX1.5-5b-I2V"
    torch_dtype=dtype,   # torch.dtype: torch.bfloat16 (recommended)
)
# Returns: CogVideoXImageToVideoPipeline

Import

from diffusers import CogVideoXImageToVideoPipeline

I/O Contract

Inputs

Parameter	Type	Required	Description
`model_path`	str	Yes	Hugging Face model identifier for the I2V variant. Supported values: `"THUDM/CogVideoX-5b-I2V"`, `"THUDM/CogVideoX1.5-5b-I2V"`.
`torch_dtype`	torch.dtype	No	Data type for model weights. Defaults to the model's saved dtype. `torch.bfloat16` is recommended for memory efficiency and speed.

Outputs

Output	Type	Description
Pipeline instance	`CogVideoXImageToVideoPipeline`	A fully initialized I2V pipeline with tokenizer, text encoder, transformer, VAE, and scheduler loaded. Ready for scheduler configuration and inference.

Supported Models and Resolutions

Model Identifier	Default Resolution (HxW)	Notes
`THUDM/CogVideoX-5b-I2V`	480 x 720	Fixed resolution.
`THUDM/CogVideoX1.5-5b-I2V`	768 x 1360	Custom resolution supported.

Usage Examples

Loading the CogVideoX-5b-I2V Pipeline

import torch
from diffusers import CogVideoXImageToVideoPipeline

pipe = CogVideoXImageToVideoPipeline.from_pretrained(
    "THUDM/CogVideoX-5b-I2V",
    torch_dtype=torch.bfloat16,
)

Loading the CogVideoX1.5-5b-I2V Pipeline

import torch
from diffusers import CogVideoXImageToVideoPipeline

pipe = CogVideoXImageToVideoPipeline.from_pretrained(
    "THUDM/CogVideoX1.5-5b-I2V",
    torch_dtype=torch.bfloat16,
)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment