Implementation:Zai org CogVideo CogVideoXPipeline Call
Appearance
Overview
Concrete tool for generating text-to-video using the CogVideoX pipeline callable provided by the diffusers library. This is the core generation step that transforms a text prompt into video frames through iterative denoising.
Source
inference/cli_demo.py:L170-181
Signature
output = pipe(
prompt: str,
height: int, # 480 for 5B, 768 for 1.5-5B
width: int, # 720 for 5B, 1360 for 1.5-5B
num_frames: int = 81,
num_inference_steps: int = 50,
guidance_scale: float = 6.0,
use_dynamic_cfg: bool = True,
num_videos_per_prompt: int = 1,
generator: torch.Generator = torch.Generator().manual_seed(42),
) -> CogVideoXPipelineOutput
# Access frames: output.frames[0] -> List[PIL.Image]
Key Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
| prompt | str | (required) | Text description of the desired video content |
| height | int | (required) | Output video height in pixels. 480 for CogVideoX-5b, 768 for CogVideoX1.5-5B |
| width | int | (required) | Output video width in pixels. 720 for CogVideoX-5b, 1360 for CogVideoX1.5-5B |
| num_frames | int | 81 | Number of frames to generate. 49 for CogVideoX-1.0, 81 for CogVideoX-1.5 |
| num_inference_steps | int | 50 | Number of denoising steps. Higher values produce better quality but take longer |
| guidance_scale | float | 6.0 | Classifier-free guidance scale. Higher values produce stronger prompt adherence |
| use_dynamic_cfg | bool | True | Whether to vary guidance scale dynamically during sampling |
| num_videos_per_prompt | int | 1 | Number of videos to generate per prompt |
| generator | torch.Generator | manual_seed(42) | Random number generator for reproducibility |
Inputs
- Text prompt -- A string describing the desired video content
- Generation parameters -- Height, width, frame count, inference steps, guidance scale, seed
Outputs
- CogVideoXPipelineOutput -- Output object containing generated video frames
- Access frames via
output.frames[0]which returns aList[PIL.Image.Image] - Each element is a single video frame as a PIL Image
- Access frames via
Usage Example
import torch
from diffusers import CogVideoXPipeline, CogVideoXDPMScheduler
# Assume pipeline is loaded and configured
pipe = CogVideoXPipeline.from_pretrained(
"THUDM/CogVideoX1.5-5B",
torch_dtype=torch.bfloat16
)
pipe.scheduler = CogVideoXDPMScheduler.from_config(
pipe.scheduler.config,
timestep_spacing="trailing"
)
pipe.enable_sequential_cpu_offload()
pipe.vae.enable_slicing()
pipe.vae.enable_tiling()
# Generate video
output = pipe(
prompt="A detailed wooden toy ship with intricately carved sails is seen "
"gliding smoothly over a calm, bytes-blue ocean.",
height=768,
width=1360,
num_frames=81,
num_inference_steps=50,
guidance_scale=6.0,
use_dynamic_cfg=True,
num_videos_per_prompt=1,
generator=torch.Generator().manual_seed(42),
)
# Access the generated frames
frames = output.frames[0] # List[PIL.Image.Image]
Import
from diffusers import CogVideoXPipeline
Related Pages
Page Connections
Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment