Implementation:Zai org CogVideo LLM Flux Gradio

Knowledge Sources	Zai_org_CogVideo
Domains	Video_Generation, Web_Interface, Pipeline_Orchestration
Last Updated	2026-02-10 00:00 GMT

Overview

LLM Flux Gradio is a Gradio-based web application that provides an interactive three-step pipeline for generating videos from text prompts by chaining an LLM caption generator, FLUX image generator, and CogVideoX video generator.

Description

This module implements a web-based user interface using Gradio that orchestrates three generative models in sequence:

Caption Generation: A Llama-3.1-8B-Instruct language model refines user prompts into detailed video descriptions. The LLM is guided by a system prompt that instructs it to produce single, descriptive video generation prompts with a randomly selected word limit (25, 50, 75, or 100 words).

Image Generation: The FLUX.1-dev diffusion model generates a 480x720 image from the refined caption with 30 inference steps and a guidance scale of 3.5.

Video Generation: The CogVideoX-5B-I2V (image-to-video) pipeline generates a 49-frame video at 480x720 resolution from the image and caption, using 50 inference steps, guidance scale 6, and dynamic CFG.

All three models are loaded at startup with device_map="balanced" for multi-GPU distribution and torch.bfloat16 precision. The CogVideoX VAE uses both slicing and tiling to reduce memory usage. A background daemon thread automatically cleans up output files older than 10 minutes, and generated videos are also converted to GIF format via moviepy.

Usage

Use this Gradio application for interactive experimentation with the full text-to-video pipeline. It is designed for prototyping and demonstration purposes, providing a step-by-step visual workflow where users can review and adjust the caption and image before generating the final video.

Code Reference

Source Location

Repository: Zai_org_CogVideo
File: tools/llm_flux_cogvideox/gradio_page.py

Entry Point

if __name__ == "__main__":
    demo.launch()

Import

# This is a standalone script; run directly:
# python tools/llm_flux_cogvideox/gradio_page.py

I/O Contract

Pipeline Functions

`generate_caption(prompt)`

Name	Type	Required	Description
`prompt`	`str`	Yes	User's text prompt describing the desired video.

Returns a str containing the LLM-refined detailed caption.

`generate_image(caption)`

Name	Type	Required	Description
`caption`	`str`	Yes	Refined caption from the LLM.

Returns a tuple of (PIL.Image, PIL.Image): one for display, one stored in Gradio state.

`generate_video(caption, image)`

Name	Type	Required	Description
`caption`	`str`	Yes	Caption used for video generation conditioning.
`image`	`PIL.Image`	Yes	Source image for image-to-video generation.

Returns a tuple of (str, str): the video file path and GIF file path.

Utility Functions

`save_video(tensor)`

Exports video frames to an MP4 file with a timestamp-based filename at 8 FPS.

`convert_to_gif(video_path)`

Converts an MP4 video to a resized GIF (240px height) at 8 FPS using moviepy.

`delete_old_files()`

Background daemon function that removes output and temporary files older than 10 minutes, running every 600 seconds.

Configuration

Parameter	Value	Description
Caption model	Llama-3.1-8B-Instruct	Local LLM for prompt refinement
Image model	FLUX.1-dev	Diffusion model for 480x720 image generation
Video model	CogVideoX-5B-I2V	Image-to-video generation, 49 frames
Image steps	30	Inference steps for FLUX
Video steps	50	Inference steps for CogVideoX
Guidance scale (video)	6	Classifier-free guidance for video generation
Seed	1337	Fixed seed for video generator reproducibility

Usage Examples

# Launch the Gradio interface
# python tools/llm_flux_cogvideox/gradio_page.py

# The web UI exposes three buttons:
# 1. "Generate Caption" - refines user prompt via LLM
# 2. "Generate Image" - creates image from caption via FLUX
# 3. "Generate Video from Image" - creates video from image+caption via CogVideoX

Related Pages

Principle:Zai_org_CogVideo_LLM_Image_Video_Pipeline

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment