Implementation:Zai org CogVideo LLM Flux CogVideoX Pipeline

Knowledge Sources	Zai_org_CogVideo
Domains	Video_Generation, Pipeline_Orchestration, Batch_Processing
Last Updated	2026-02-10 00:00 GMT

Overview

LLM Flux CogVideoX Pipeline is a command-line batch generation script that chains an LLM, FLUX image generator, and CogVideoX video generator to produce multiple videos from auto-generated captions without any external API dependencies.

Description

This script implements the same three-stage pipeline as the Gradio variant but optimized for batch, non-interactive use via command-line arguments. It processes multiple videos in sequence with careful GPU memory management:

Caption Generation: A configurable LLM (default: GLM-4-9B-Chat or Llama-3.1-8B) generates detailed video descriptions via a transformers text-generation pipeline. Each caption uses a randomly selected word limit (50, 75, or 100 words). All captions are generated first, saved to captions.json, then the LLM is unloaded.

Image Generation: The FLUX.1-dev diffusion pipeline generates 480x720 images from each caption. Supports torch.compile for acceleration and configurable inference steps. Images are saved individually, then the image generator is unloaded.

Video Generation: CogVideoX-5B-I2V generates 49-frame videos from each image-caption pair using a DPM scheduler with trailing timestep spacing. Supports dynamic CFG, configurable guidance scale, VAE tiling, and torch.compile.

Each model is loaded onto GPU, used, then explicitly deleted with garbage collection and CUDA memory clearing between stages to fit all three models within available GPU memory.

Usage

Use this script for automated batch video generation from programmatic prompts. It is designed for production workflows where multiple videos need to be generated without human interaction, and where resource-constrained environments require sequential model loading.

Code Reference

Source Location

Repository: Zai_org_CogVideo
File: tools/llm_flux_cogvideox/llm_flux_cogvideox.py

Entry Point

def main(args: Dict[str, Any]) -> None:

Import

# This is a standalone CLI script; run directly:
# python tools/llm_flux_cogvideox/llm_flux_cogvideox.py --num_videos 5

I/O Contract

Command-Line Arguments

Name	Type	Default	Description
`--num_videos`	`int`	`5`	Number of unique videos to generate.
`--model_path`	`str`	`THUDM/CogVideoX-5B`	Path to CogVideoX-5B-I2V model.
`--caption_generator_model_id`	`str`	`THUDM/glm-4-9b-chat`	LLM model for caption generation.
`--caption_generator_cache_dir`	`str`	`None`	Cache directory for caption model.
`--image_generator_model_id`	`str`	`black-forest-labs/FLUX.1-dev`	Image generation model identifier.
`--image_generator_cache_dir`	`str`	`None`	Cache directory for image model.
`--image_generator_num_inference_steps`	`int`	`50`	Number of diffusion steps for image generation.
`--guidance_scale`	`float`	`7`	Guidance scale for video generation.
`--use_dynamic_cfg`	flag	`False`	Enable cosine dynamic guidance for video generation.
`--output_dir`	`str`	`outputs/`	Directory for generated images and videos.
`--compile`	flag	`False`	Compile transformer models with `torch.compile` for acceleration.
`--enable_vae_tiling`	flag	`False`	Enable VAE tiling for memory-efficient encoding/decoding.
`--seed`	`int`	`42`	Random seed for reproducibility.

Outputs

Name	Type	Description
`captions.json`	JSON file	All generated captions.
`{index}_{caption_prefix}.png`	PNG files	Generated images, one per caption.
`{index}_{caption_prefix}.mp4`	MP4 files	Generated videos at 8 FPS, one per image-caption pair.

Key Functions

`get_args()`

Parses command-line arguments using argparse.

`reset_memory()`

Clears GPU memory between model stages via gc.collect(), torch.cuda.empty_cache(), and CUDA memory stats reset.

`main(args)`

Orchestrates the three-stage pipeline with @torch.no_grad() decorator. Sequentially loads each model, generates outputs, saves results, and unloads the model before proceeding to the next stage.

Usage Examples

# Generate 5 videos with default settings
# python tools/llm_flux_cogvideox/llm_flux_cogvideox.py --num_videos 5

# Generate with torch.compile and dynamic CFG
# python tools/llm_flux_cogvideox/llm_flux_cogvideox.py \
#     --num_videos 3 \
#     --compile \
#     --use_dynamic_cfg \
#     --enable_vae_tiling \
#     --output_dir ./my_videos/

# Use custom models
# python tools/llm_flux_cogvideox/llm_flux_cogvideox.py \
#     --caption_generator_model_id meta-llama/Llama-3.1-8B-Instruct \
#     --model_path THUDM/CogVideoX-5B \
#     --num_videos 10

Related Pages

Principle:Zai_org_CogVideo_LLM_Image_Video_Pipeline

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment