Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Zai org CogVideo LLM Flux Gradio

From Leeroopedia


Knowledge Sources
Domains Video_Generation, Web_Interface, Pipeline_Orchestration
Last Updated 2026-02-10 00:00 GMT

Overview

LLM Flux Gradio is a Gradio-based web application that provides an interactive three-step pipeline for generating videos from text prompts by chaining an LLM caption generator, FLUX image generator, and CogVideoX video generator.

Description

This module implements a web-based user interface using Gradio that orchestrates three generative models in sequence:

  1. Caption Generation: A Llama-3.1-8B-Instruct language model refines user prompts into detailed video descriptions. The LLM is guided by a system prompt that instructs it to produce single, descriptive video generation prompts with a randomly selected word limit (25, 50, 75, or 100 words).
  1. Image Generation: The FLUX.1-dev diffusion model generates a 480x720 image from the refined caption with 30 inference steps and a guidance scale of 3.5.
  1. Video Generation: The CogVideoX-5B-I2V (image-to-video) pipeline generates a 49-frame video at 480x720 resolution from the image and caption, using 50 inference steps, guidance scale 6, and dynamic CFG.

All three models are loaded at startup with device_map="balanced" for multi-GPU distribution and torch.bfloat16 precision. The CogVideoX VAE uses both slicing and tiling to reduce memory usage. A background daemon thread automatically cleans up output files older than 10 minutes, and generated videos are also converted to GIF format via moviepy.

Usage

Use this Gradio application for interactive experimentation with the full text-to-video pipeline. It is designed for prototyping and demonstration purposes, providing a step-by-step visual workflow where users can review and adjust the caption and image before generating the final video.

Code Reference

Source Location

Entry Point

if __name__ == "__main__":
    demo.launch()

Import

# This is a standalone script; run directly:
# python tools/llm_flux_cogvideox/gradio_page.py

I/O Contract

Pipeline Functions

generate_caption(prompt)

Name Type Required Description
prompt str Yes User's text prompt describing the desired video.

Returns a str containing the LLM-refined detailed caption.

generate_image(caption)

Name Type Required Description
caption str Yes Refined caption from the LLM.

Returns a tuple of (PIL.Image, PIL.Image): one for display, one stored in Gradio state.

generate_video(caption, image)

Name Type Required Description
caption str Yes Caption used for video generation conditioning.
image PIL.Image Yes Source image for image-to-video generation.

Returns a tuple of (str, str): the video file path and GIF file path.

Utility Functions

save_video(tensor)

Exports video frames to an MP4 file with a timestamp-based filename at 8 FPS.

convert_to_gif(video_path)

Converts an MP4 video to a resized GIF (240px height) at 8 FPS using moviepy.

delete_old_files()

Background daemon function that removes output and temporary files older than 10 minutes, running every 600 seconds.

Configuration

Parameter Value Description
Caption model Llama-3.1-8B-Instruct Local LLM for prompt refinement
Image model FLUX.1-dev Diffusion model for 480x720 image generation
Video model CogVideoX-5B-I2V Image-to-video generation, 49 frames
Image steps 30 Inference steps for FLUX
Video steps 50 Inference steps for CogVideoX
Guidance scale (video) 6 Classifier-free guidance for video generation
Seed 1337 Fixed seed for video generator reproducibility

Usage Examples

# Launch the Gradio interface
# python tools/llm_flux_cogvideox/gradio_page.py

# The web UI exposes three buttons:
# 1. "Generate Caption" - refines user prompt via LLM
# 2. "Generate Image" - creates image from caption via FLUX
# 3. "Generate Video from Image" - creates video from image+caption via CogVideoX

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment