Principle:Zai org CogVideo I2V Scheduler and Memory Config

Metadata

Field	Value
Page Type	Principle
Knowledge Sources	Repo (CogVideo), Paper (CogVideoX)
Domains	Video_Generation, Diffusion_Models, Image_Conditioning
Last Updated	2026-02-10 00:00 GMT

Overview

Technique for configuring the noise scheduler and memory optimization settings for image-to-video generation pipelines.

Description

I2V pipeline configuration combines scheduler selection with memory optimization in a single setup step. The DPM scheduler with trailing timestep spacing is configured for optimal quality, and CPU offloading plus VAE slicing/tiling are enabled for memory efficiency. This is identical to T2V configuration but applied to the I2V pipeline.

Scheduler Configuration

The CogVideoXDPMScheduler is set with timestep_spacing="trailing", replacing the default scheduler loaded with the pipeline. The DPM-Solver scheduler provides faster convergence than DDIM for the CogVideoX-5B model family, requiring fewer inference steps to achieve comparable quality. The "trailing" timestep spacing places timesteps at the end of each interval, which has been shown to improve sample quality for diffusion models.

Memory Optimization

Three memory optimization techniques are applied:

Sequential CPU Offload: Moves model components (text encoder, transformer, VAE) to CPU when not actively in use during inference, keeping only the currently executing component on GPU. This dramatically reduces peak GPU memory usage at the cost of increased inference time due to CPU-GPU data transfers.
VAE Slicing: Processes VAE encoding and decoding in slices along the batch dimension rather than all at once, reducing peak memory consumption during the VAE pass.
VAE Tiling: Processes the VAE encoding and decoding in spatial tiles rather than the full resolution at once, further reducing peak memory for high-resolution video frames.

Usage

Use after loading the I2V pipeline and before generating videos. Apply the same scheduler and memory settings as the T2V workflow. These settings are especially important for single-GPU setups with limited VRAM (e.g., 24 GB consumer GPUs).

For multi-GPU setups or GPUs with ample memory (e.g., H100 with 80 GB), CPU offloading can be replaced with pipe.to("cuda") for faster inference.

Theoretical Basis

DPM-Solver for Efficient Sampling

DPM-Solver is a family of dedicated high-order solvers for diffusion probabilistic models (DPMs). Unlike general-purpose ODE solvers, DPM-Solver exploits the semi-linear structure of the diffusion ODE to achieve faster convergence. The DPM-Solver++ variant used in CogVideoX provides high-quality samples in 20-50 steps, compared to 100+ steps required by naive DDPM sampling.

Trailing Timestep Spacing

In trailing timestep spacing, the discrete timesteps are placed at the trailing edge of uniformly divided intervals in the noise schedule. For a schedule with T total timesteps and N inference steps, the timesteps are computed as:

t_i = T - (T / N) * i for i = 0, 1, ..., N-1

This spacing provides better coverage of the high-noise regime at the start of the denoising process, which is critical for establishing global structure in the generated video.

CPU Offloading for Memory Management

Sequential CPU offloading implements a producer-consumer pattern where only one model component resides on GPU at any time. The execution order is:

Text encoder processes the prompt on GPU, then moves to CPU.
Transformer performs iterative denoising on GPU, then moves to CPU.
VAE decodes latents to pixels on GPU, then moves to CPU.

This reduces peak GPU memory from the sum of all components to the size of the largest single component.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment