Implementation:Zai org CogVideo LLM Flux Gradio
| Knowledge Sources | |
|---|---|
| Domains | Video_Generation, Web_Interface, Pipeline_Orchestration |
| Last Updated | 2026-02-10 00:00 GMT |
Overview
LLM Flux Gradio is a Gradio-based web application that provides an interactive three-step pipeline for generating videos from text prompts by chaining an LLM caption generator, FLUX image generator, and CogVideoX video generator.
Description
This module implements a web-based user interface using Gradio that orchestrates three generative models in sequence:
- Caption Generation: A Llama-3.1-8B-Instruct language model refines user prompts into detailed video descriptions. The LLM is guided by a system prompt that instructs it to produce single, descriptive video generation prompts with a randomly selected word limit (25, 50, 75, or 100 words).
- Image Generation: The FLUX.1-dev diffusion model generates a 480x720 image from the refined caption with 30 inference steps and a guidance scale of 3.5.
- Video Generation: The CogVideoX-5B-I2V (image-to-video) pipeline generates a 49-frame video at 480x720 resolution from the image and caption, using 50 inference steps, guidance scale 6, and dynamic CFG.
All three models are loaded at startup with device_map="balanced" for multi-GPU distribution and torch.bfloat16 precision. The CogVideoX VAE uses both slicing and tiling to reduce memory usage. A background daemon thread automatically cleans up output files older than 10 minutes, and generated videos are also converted to GIF format via moviepy.
Usage
Use this Gradio application for interactive experimentation with the full text-to-video pipeline. It is designed for prototyping and demonstration purposes, providing a step-by-step visual workflow where users can review and adjust the caption and image before generating the final video.
Code Reference
Source Location
- Repository: Zai_org_CogVideo
- File:
tools/llm_flux_cogvideox/gradio_page.py
Entry Point
if __name__ == "__main__":
demo.launch()
Import
# This is a standalone script; run directly:
# python tools/llm_flux_cogvideox/gradio_page.py
I/O Contract
Pipeline Functions
generate_caption(prompt)
| Name | Type | Required | Description |
|---|---|---|---|
prompt |
str |
Yes | User's text prompt describing the desired video. |
Returns a str containing the LLM-refined detailed caption.
generate_image(caption)
| Name | Type | Required | Description |
|---|---|---|---|
caption |
str |
Yes | Refined caption from the LLM. |
Returns a tuple of (PIL.Image, PIL.Image): one for display, one stored in Gradio state.
generate_video(caption, image)
| Name | Type | Required | Description |
|---|---|---|---|
caption |
str |
Yes | Caption used for video generation conditioning. |
image |
PIL.Image |
Yes | Source image for image-to-video generation. |
Returns a tuple of (str, str): the video file path and GIF file path.
Utility Functions
save_video(tensor)
Exports video frames to an MP4 file with a timestamp-based filename at 8 FPS.
convert_to_gif(video_path)
Converts an MP4 video to a resized GIF (240px height) at 8 FPS using moviepy.
delete_old_files()
Background daemon function that removes output and temporary files older than 10 minutes, running every 600 seconds.
Configuration
| Parameter | Value | Description |
|---|---|---|
| Caption model | Llama-3.1-8B-Instruct | Local LLM for prompt refinement |
| Image model | FLUX.1-dev | Diffusion model for 480x720 image generation |
| Video model | CogVideoX-5B-I2V | Image-to-video generation, 49 frames |
| Image steps | 30 | Inference steps for FLUX |
| Video steps | 50 | Inference steps for CogVideoX |
| Guidance scale (video) | 6 | Classifier-free guidance for video generation |
| Seed | 1337 | Fixed seed for video generator reproducibility |
Usage Examples
# Launch the Gradio interface
# python tools/llm_flux_cogvideox/gradio_page.py
# The web UI exposes three buttons:
# 1. "Generate Caption" - refines user prompt via LLM
# 2. "Generate Image" - creates image from caption via FLUX
# 3. "Generate Video from Image" - creates video from image+caption via CogVideoX