Principle:Zai org CogVideo I2V Pipeline Loading
Metadata
| Field | Value |
|---|---|
| Page Type | Principle |
| Knowledge Sources | Repo (CogVideo), Paper (CogVideoX) |
| Domains | Video_Generation, Diffusion_Models, Image_Conditioning |
| Last Updated | 2026-02-10 00:00 GMT |
Overview
Technique for loading a complete image-to-video diffusion pipeline that generates video conditioned on both a text prompt and an input image.
Description
Image-to-video pipeline loading extends text-to-video by adding image conditioning. The pipeline loads the same components (tokenizer, text encoder, transformer, VAE, scheduler) but uses an I2V-specific transformer variant that accepts image latents as additional conditioning. This enables generating videos that start from or are consistent with a reference image.
Component Architecture
The I2V pipeline loads five core components from pretrained weights:
- Tokenizer: Converts text prompts into token sequences for the text encoder.
- Text Encoder: Transforms token sequences into dense text embeddings that condition the denoising process.
- Transformer: The I2V-specific variant of the CogVideoX transformer, which accepts both text embeddings and image latents as conditioning inputs. This transformer has double the input channels (32 vs 16) compared to the text-to-video variant, to accommodate the concatenated image latent.
- VAE (Variational Autoencoder): Encodes images into latent space and decodes denoised latents back into pixel space for video frames.
- Scheduler: Controls the noise schedule and denoising step progression during inference.
Model Variants
Two I2V model variants are available:
- CogVideoX-5b-I2V: Generates video at 480x720 resolution with 5 billion parameters.
- CogVideoX1.5-5b-I2V: Generates video at 768x1360 resolution (or custom resolution) with 5 billion parameters and an updated architecture.
Usage
Use when generating videos from an input image plus text description. Requires an I2V-specific model variant (e.g., THUDM/CogVideoX-5b-I2V or THUDM/CogVideoX1.5-5b-I2V). The I2V pipeline is loaded identically to the T2V pipeline but with the CogVideoXImageToVideoPipeline class and an I2V model checkpoint.
Theoretical Basis
Image Conditioning via Channel Concatenation
Image conditioning works by concatenating the encoded image latent with the noisy video latent along the channel dimension. The conditioning image is first encoded by the VAE encoder into a latent representation z_img. During each denoising step, this image latent is concatenated with the noisy video latent z_t along the channel axis:
input = concat(z_t, z_img, dim=channels)
This doubles the number of input channels from 16 (text-to-video) to 32 (image-to-video). The I2V transformer variant has its input projection layer modified to accept 32 channels rather than 16.
Consistency Preservation
The transformer learns to denoise while preserving visual consistency with the conditioning image. During training, the model sees pairs of (image, video) where the image corresponds to the first frame or a key frame of the video. The concatenation mechanism allows the model to attend to fine-grained spatial details in the image latent at every denoising step, maintaining structural and appearance consistency throughout the generated video sequence.