Implementation:Zai org CogVideo LLM Flux CogVideoX Pipeline
| Knowledge Sources | |
|---|---|
| Domains | Video_Generation, Pipeline_Orchestration, Batch_Processing |
| Last Updated | 2026-02-10 00:00 GMT |
Overview
LLM Flux CogVideoX Pipeline is a command-line batch generation script that chains an LLM, FLUX image generator, and CogVideoX video generator to produce multiple videos from auto-generated captions without any external API dependencies.
Description
This script implements the same three-stage pipeline as the Gradio variant but optimized for batch, non-interactive use via command-line arguments. It processes multiple videos in sequence with careful GPU memory management:
- Caption Generation: A configurable LLM (default: GLM-4-9B-Chat or Llama-3.1-8B) generates detailed video descriptions via a transformers text-generation pipeline. Each caption uses a randomly selected word limit (50, 75, or 100 words). All captions are generated first, saved to
captions.json, then the LLM is unloaded.
- Image Generation: The FLUX.1-dev diffusion pipeline generates 480x720 images from each caption. Supports
torch.compilefor acceleration and configurable inference steps. Images are saved individually, then the image generator is unloaded.
- Video Generation: CogVideoX-5B-I2V generates 49-frame videos from each image-caption pair using a DPM scheduler with trailing timestep spacing. Supports dynamic CFG, configurable guidance scale, VAE tiling, and
torch.compile.
Each model is loaded onto GPU, used, then explicitly deleted with garbage collection and CUDA memory clearing between stages to fit all three models within available GPU memory.
Usage
Use this script for automated batch video generation from programmatic prompts. It is designed for production workflows where multiple videos need to be generated without human interaction, and where resource-constrained environments require sequential model loading.
Code Reference
Source Location
- Repository: Zai_org_CogVideo
- File:
tools/llm_flux_cogvideox/llm_flux_cogvideox.py
Entry Point
def main(args: Dict[str, Any]) -> None:
Import
# This is a standalone CLI script; run directly:
# python tools/llm_flux_cogvideox/llm_flux_cogvideox.py --num_videos 5
I/O Contract
Command-Line Arguments
| Name | Type | Default | Description |
|---|---|---|---|
--num_videos |
int |
5 |
Number of unique videos to generate. |
--model_path |
str |
THUDM/CogVideoX-5B |
Path to CogVideoX-5B-I2V model. |
--caption_generator_model_id |
str |
THUDM/glm-4-9b-chat |
LLM model for caption generation. |
--caption_generator_cache_dir |
str |
None |
Cache directory for caption model. |
--image_generator_model_id |
str |
black-forest-labs/FLUX.1-dev |
Image generation model identifier. |
--image_generator_cache_dir |
str |
None |
Cache directory for image model. |
--image_generator_num_inference_steps |
int |
50 |
Number of diffusion steps for image generation. |
--guidance_scale |
float |
7 |
Guidance scale for video generation. |
--use_dynamic_cfg |
flag | False |
Enable cosine dynamic guidance for video generation. |
--output_dir |
str |
outputs/ |
Directory for generated images and videos. |
--compile |
flag | False |
Compile transformer models with torch.compile for acceleration.
|
--enable_vae_tiling |
flag | False |
Enable VAE tiling for memory-efficient encoding/decoding. |
--seed |
int |
42 |
Random seed for reproducibility. |
Outputs
| Name | Type | Description |
|---|---|---|
captions.json |
JSON file | All generated captions. |
{index}_{caption_prefix}.png |
PNG files | Generated images, one per caption. |
{index}_{caption_prefix}.mp4 |
MP4 files | Generated videos at 8 FPS, one per image-caption pair. |
Key Functions
get_args()
Parses command-line arguments using argparse.
reset_memory()
Clears GPU memory between model stages via gc.collect(), torch.cuda.empty_cache(), and CUDA memory stats reset.
main(args)
Orchestrates the three-stage pipeline with @torch.no_grad() decorator. Sequentially loads each model, generates outputs, saves results, and unloads the model before proceeding to the next stage.
Usage Examples
# Generate 5 videos with default settings
# python tools/llm_flux_cogvideox/llm_flux_cogvideox.py --num_videos 5
# Generate with torch.compile and dynamic CFG
# python tools/llm_flux_cogvideox/llm_flux_cogvideox.py \
# --num_videos 3 \
# --compile \
# --use_dynamic_cfg \
# --enable_vae_tiling \
# --output_dir ./my_videos/
# Use custom models
# python tools/llm_flux_cogvideox/llm_flux_cogvideox.py \
# --caption_generator_model_id meta-llama/Llama-3.1-8B-Instruct \
# --model_path THUDM/CogVideoX-5B \
# --num_videos 10