Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Zai org CogVideo LLM Flux CogVideoX Pipeline

From Leeroopedia


Knowledge Sources
Domains Video_Generation, Pipeline_Orchestration, Batch_Processing
Last Updated 2026-02-10 00:00 GMT

Overview

LLM Flux CogVideoX Pipeline is a command-line batch generation script that chains an LLM, FLUX image generator, and CogVideoX video generator to produce multiple videos from auto-generated captions without any external API dependencies.

Description

This script implements the same three-stage pipeline as the Gradio variant but optimized for batch, non-interactive use via command-line arguments. It processes multiple videos in sequence with careful GPU memory management:

  1. Caption Generation: A configurable LLM (default: GLM-4-9B-Chat or Llama-3.1-8B) generates detailed video descriptions via a transformers text-generation pipeline. Each caption uses a randomly selected word limit (50, 75, or 100 words). All captions are generated first, saved to captions.json, then the LLM is unloaded.
  1. Image Generation: The FLUX.1-dev diffusion pipeline generates 480x720 images from each caption. Supports torch.compile for acceleration and configurable inference steps. Images are saved individually, then the image generator is unloaded.
  1. Video Generation: CogVideoX-5B-I2V generates 49-frame videos from each image-caption pair using a DPM scheduler with trailing timestep spacing. Supports dynamic CFG, configurable guidance scale, VAE tiling, and torch.compile.

Each model is loaded onto GPU, used, then explicitly deleted with garbage collection and CUDA memory clearing between stages to fit all three models within available GPU memory.

Usage

Use this script for automated batch video generation from programmatic prompts. It is designed for production workflows where multiple videos need to be generated without human interaction, and where resource-constrained environments require sequential model loading.

Code Reference

Source Location

  • Repository: Zai_org_CogVideo
  • File: tools/llm_flux_cogvideox/llm_flux_cogvideox.py

Entry Point

def main(args: Dict[str, Any]) -> None:

Import

# This is a standalone CLI script; run directly:
# python tools/llm_flux_cogvideox/llm_flux_cogvideox.py --num_videos 5

I/O Contract

Command-Line Arguments

Name Type Default Description
--num_videos int 5 Number of unique videos to generate.
--model_path str THUDM/CogVideoX-5B Path to CogVideoX-5B-I2V model.
--caption_generator_model_id str THUDM/glm-4-9b-chat LLM model for caption generation.
--caption_generator_cache_dir str None Cache directory for caption model.
--image_generator_model_id str black-forest-labs/FLUX.1-dev Image generation model identifier.
--image_generator_cache_dir str None Cache directory for image model.
--image_generator_num_inference_steps int 50 Number of diffusion steps for image generation.
--guidance_scale float 7 Guidance scale for video generation.
--use_dynamic_cfg flag False Enable cosine dynamic guidance for video generation.
--output_dir str outputs/ Directory for generated images and videos.
--compile flag False Compile transformer models with torch.compile for acceleration.
--enable_vae_tiling flag False Enable VAE tiling for memory-efficient encoding/decoding.
--seed int 42 Random seed for reproducibility.

Outputs

Name Type Description
captions.json JSON file All generated captions.
{index}_{caption_prefix}.png PNG files Generated images, one per caption.
{index}_{caption_prefix}.mp4 MP4 files Generated videos at 8 FPS, one per image-caption pair.

Key Functions

get_args()

Parses command-line arguments using argparse.

reset_memory()

Clears GPU memory between model stages via gc.collect(), torch.cuda.empty_cache(), and CUDA memory stats reset.

main(args)

Orchestrates the three-stage pipeline with @torch.no_grad() decorator. Sequentially loads each model, generates outputs, saves results, and unloads the model before proceeding to the next stage.

Usage Examples

# Generate 5 videos with default settings
# python tools/llm_flux_cogvideox/llm_flux_cogvideox.py --num_videos 5

# Generate with torch.compile and dynamic CFG
# python tools/llm_flux_cogvideox/llm_flux_cogvideox.py \
#     --num_videos 3 \
#     --compile \
#     --use_dynamic_cfg \
#     --enable_vae_tiling \
#     --output_dir ./my_videos/

# Use custom models
# python tools/llm_flux_cogvideox/llm_flux_cogvideox.py \
#     --caption_generator_model_id meta-llama/Llama-3.1-8B-Instruct \
#     --model_path THUDM/CogVideoX-5B \
#     --num_videos 10

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment