Implementation:Zai org CogVideo CogVideoXI2VPipeline From Pretrained
Metadata
| Field | Value |
|---|---|
| Page Type | Implementation (Wrapper Doc) |
| Knowledge Sources | Repo (CogVideo), Paper (CogVideoX) |
| Domains | Video_Generation, Diffusion_Models, Image_Conditioning |
| Last Updated | 2026-02-10 00:00 GMT |
Overview
Concrete tool for loading the CogVideoX image-to-video pipeline from pretrained weights provided by the diffusers library.
Description
CogVideoXImageToVideoPipeline.from_pretrained is the factory method that downloads (or loads from cache) and initializes all components of the I2V pipeline from a Hugging Face model identifier. It loads the tokenizer, text encoder, I2V-specific transformer, VAE, and scheduler into a single callable pipeline object. The I2V transformer variant has 32 input channels (double the T2V variant's 16 channels) to accommodate the concatenated image latent.
Two supported model identifiers correspond to different resolution and architecture variants:
THUDM/CogVideoX-5b-I2V: Generates video at 480x720 resolution.THUDM/CogVideoX1.5-5b-I2V: Generates video at 768x1360 resolution (custom resolution also supported).
Usage
Import CogVideoXImageToVideoPipeline from the diffusers library and call from_pretrained with an I2V model identifier. The returned pipeline object is ready for scheduler configuration, memory optimization, and video generation.
Code Reference
Source Location
inference/cli_demo.py, line 119.
Signature
pipe = CogVideoXImageToVideoPipeline.from_pretrained(
model_path, # str: "THUDM/CogVideoX-5b-I2V" or "THUDM/CogVideoX1.5-5b-I2V"
torch_dtype=dtype, # torch.dtype: torch.bfloat16 (recommended)
)
# Returns: CogVideoXImageToVideoPipeline
Import
from diffusers import CogVideoXImageToVideoPipeline
I/O Contract
Inputs
| Parameter | Type | Required | Description |
|---|---|---|---|
model_path |
str | Yes | Hugging Face model identifier for the I2V variant. Supported values: "THUDM/CogVideoX-5b-I2V", "THUDM/CogVideoX1.5-5b-I2V".
|
torch_dtype |
torch.dtype | No | Data type for model weights. Defaults to the model's saved dtype. torch.bfloat16 is recommended for memory efficiency and speed.
|
Outputs
| Output | Type | Description |
|---|---|---|
| Pipeline instance | CogVideoXImageToVideoPipeline |
A fully initialized I2V pipeline with tokenizer, text encoder, transformer, VAE, and scheduler loaded. Ready for scheduler configuration and inference. |
Supported Models and Resolutions
| Model Identifier | Default Resolution (HxW) | Notes |
|---|---|---|
THUDM/CogVideoX-5b-I2V |
480 x 720 | Fixed resolution. |
THUDM/CogVideoX1.5-5b-I2V |
768 x 1360 | Custom resolution supported. |
Usage Examples
Loading the CogVideoX-5b-I2V Pipeline
import torch
from diffusers import CogVideoXImageToVideoPipeline
pipe = CogVideoXImageToVideoPipeline.from_pretrained(
"THUDM/CogVideoX-5b-I2V",
torch_dtype=torch.bfloat16,
)
Loading the CogVideoX1.5-5b-I2V Pipeline
import torch
from diffusers import CogVideoXImageToVideoPipeline
pipe = CogVideoXImageToVideoPipeline.from_pretrained(
"THUDM/CogVideoX1.5-5b-I2V",
torch_dtype=torch.bfloat16,
)