Implementation:Zai org CogVideo CogVideoXPipeline Call

Overview

Concrete tool for generating text-to-video using the CogVideoX pipeline callable provided by the diffusers library. This is the core generation step that transforms a text prompt into video frames through iterative denoising.

Source

inference/cli_demo.py:L170-181

Signature

output = pipe(
    prompt: str,
    height: int,           # 480 for 5B, 768 for 1.5-5B
    width: int,            # 720 for 5B, 1360 for 1.5-5B
    num_frames: int = 81,
    num_inference_steps: int = 50,
    guidance_scale: float = 6.0,
    use_dynamic_cfg: bool = True,
    num_videos_per_prompt: int = 1,
    generator: torch.Generator = torch.Generator().manual_seed(42),
) -> CogVideoXPipelineOutput
# Access frames: output.frames[0] -> List[PIL.Image]

Key Parameters

Parameter	Type	Default	Description
prompt	str	(required)	Text description of the desired video content
height	int	(required)	Output video height in pixels. 480 for CogVideoX-5b, 768 for CogVideoX1.5-5B
width	int	(required)	Output video width in pixels. 720 for CogVideoX-5b, 1360 for CogVideoX1.5-5B
num_frames	int	81	Number of frames to generate. 49 for CogVideoX-1.0, 81 for CogVideoX-1.5
num_inference_steps	int	50	Number of denoising steps. Higher values produce better quality but take longer
guidance_scale	float	6.0	Classifier-free guidance scale. Higher values produce stronger prompt adherence
use_dynamic_cfg	bool	True	Whether to vary guidance scale dynamically during sampling
num_videos_per_prompt	int	1	Number of videos to generate per prompt
generator	torch.Generator	manual_seed(42)	Random number generator for reproducibility

Inputs

Text prompt -- A string describing the desired video content
Generation parameters -- Height, width, frame count, inference steps, guidance scale, seed

Outputs

CogVideoXPipelineOutput -- Output object containing generated video frames
- Access frames via output.frames[0] which returns a List[PIL.Image.Image]
- Each element is a single video frame as a PIL Image

Usage Example

import torch
from diffusers import CogVideoXPipeline, CogVideoXDPMScheduler

# Assume pipeline is loaded and configured
pipe = CogVideoXPipeline.from_pretrained(
    "THUDM/CogVideoX1.5-5B",
    torch_dtype=torch.bfloat16
)
pipe.scheduler = CogVideoXDPMScheduler.from_config(
    pipe.scheduler.config,
    timestep_spacing="trailing"
)
pipe.enable_sequential_cpu_offload()
pipe.vae.enable_slicing()
pipe.vae.enable_tiling()

# Generate video
output = pipe(
    prompt="A detailed wooden toy ship with intricately carved sails is seen "
           "gliding smoothly over a calm, bytes-blue ocean.",
    height=768,
    width=1360,
    num_frames=81,
    num_inference_steps=50,
    guidance_scale=6.0,
    use_dynamic_cfg=True,
    num_videos_per_prompt=1,
    generator=torch.Generator().manual_seed(42),
)

# Access the generated frames
frames = output.frames[0]  # List[PIL.Image.Image]

Import

from diffusers import CogVideoXPipeline

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment