Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Zai org CogVideo CogVideoXPipeline Call

From Leeroopedia


Template:Implementation

Overview

Concrete tool for generating text-to-video using the CogVideoX pipeline callable provided by the diffusers library. This is the core generation step that transforms a text prompt into video frames through iterative denoising.

Source

inference/cli_demo.py:L170-181

Signature

output = pipe(
    prompt: str,
    height: int,           # 480 for 5B, 768 for 1.5-5B
    width: int,            # 720 for 5B, 1360 for 1.5-5B
    num_frames: int = 81,
    num_inference_steps: int = 50,
    guidance_scale: float = 6.0,
    use_dynamic_cfg: bool = True,
    num_videos_per_prompt: int = 1,
    generator: torch.Generator = torch.Generator().manual_seed(42),
) -> CogVideoXPipelineOutput
# Access frames: output.frames[0] -> List[PIL.Image]

Key Parameters

Parameter Type Default Description
prompt str (required) Text description of the desired video content
height int (required) Output video height in pixels. 480 for CogVideoX-5b, 768 for CogVideoX1.5-5B
width int (required) Output video width in pixels. 720 for CogVideoX-5b, 1360 for CogVideoX1.5-5B
num_frames int 81 Number of frames to generate. 49 for CogVideoX-1.0, 81 for CogVideoX-1.5
num_inference_steps int 50 Number of denoising steps. Higher values produce better quality but take longer
guidance_scale float 6.0 Classifier-free guidance scale. Higher values produce stronger prompt adherence
use_dynamic_cfg bool True Whether to vary guidance scale dynamically during sampling
num_videos_per_prompt int 1 Number of videos to generate per prompt
generator torch.Generator manual_seed(42) Random number generator for reproducibility

Inputs

  • Text prompt -- A string describing the desired video content
  • Generation parameters -- Height, width, frame count, inference steps, guidance scale, seed

Outputs

  • CogVideoXPipelineOutput -- Output object containing generated video frames
    • Access frames via output.frames[0] which returns a List[PIL.Image.Image]
    • Each element is a single video frame as a PIL Image

Usage Example

import torch
from diffusers import CogVideoXPipeline, CogVideoXDPMScheduler

# Assume pipeline is loaded and configured
pipe = CogVideoXPipeline.from_pretrained(
    "THUDM/CogVideoX1.5-5B",
    torch_dtype=torch.bfloat16
)
pipe.scheduler = CogVideoXDPMScheduler.from_config(
    pipe.scheduler.config,
    timestep_spacing="trailing"
)
pipe.enable_sequential_cpu_offload()
pipe.vae.enable_slicing()
pipe.vae.enable_tiling()

# Generate video
output = pipe(
    prompt="A detailed wooden toy ship with intricately carved sails is seen "
           "gliding smoothly over a calm, bytes-blue ocean.",
    height=768,
    width=1360,
    num_frames=81,
    num_inference_steps=50,
    guidance_scale=6.0,
    use_dynamic_cfg=True,
    num_videos_per_prompt=1,
    generator=torch.Generator().manual_seed(42),
)

# Access the generated frames
frames = output.frames[0]  # List[PIL.Image.Image]

Import

from diffusers import CogVideoXPipeline

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment