Implementation:Sgl project Sglang LLaVA Video Pipeline

Knowledge Sources	Sgl_project_Sglang
Domains	Multimodal, Video Understanding
Last Updated	2026-02-10 00:00 GMT

Overview

Demonstrates video question-answering using the LLaVA-Video model with SGLang, supporting both single video and batch processing across distributed chunks.

Description

srt_example_llava_v.py uses the sgl.video() multimodal primitive to embed video frames in prompts and sgl.gen for generating descriptions. The core video_qa function, decorated with @sgl.function, takes a video path and question, constructs a user message with video frames, and generates an assistant response.

For batch processing, videos are split into configurable chunks (split_into_chunks) for multi-node distributed processing. Results are saved incrementally to CSV files via save_batch_results and compiled into final results via compile_and_cleanup_final_results. The script also handles video file discovery from directories, supporting .mp4, .avi, and .mov formats.

The main block configures the LLaVA-Video model with custom json_model_override_args including architecture type (LlavaVidForCausalLM), spatial pooling stride (mm_spatial_pool_stride), and optional RoPE scaling for 32-frame processing. It downloads a sample video from GitHub for testing and supports both 7B and 34B model variants with appropriate tokenizer paths.

Usage

Use this example for video understanding tasks including video description, question-answering, and batch video analysis. It demonstrates multi-node distributed video processing with SGLang's multimodal capabilities.

Code Reference

Source Location

Repository: Sgl_project_Sglang
File: examples/frontend_language/usage/llava_video/srt_example_llava_v.py
Lines: 1-260

Signature

@sgl.function
def video_qa(s, num_frames, video_path, question): ...

def single(path, num_frames=16): ...
def split_into_chunks(lst, num_chunks): ...
def save_batch_results(batch_video_files, states, cur_chunk, batch_idx, save_dir): ...
def compile_and_cleanup_final_results(cur_chunk, num_batches, save_dir): ...
def find_video_files(video_dir): ...
def batch(video_dir, save_dir, cur_chunk, num_chunks, num_frames=16, batch_size=64): ...

Import

import argparse
import csv
import json
import os
import time

import requests

import sglang as sgl

I/O Contract

Inputs

Name	Type	Required	Description
--port	int	No (default: 30000)	The master port for distributed serving
--chunk-idx	int	No (default: 0)	The index of the chunk to process
--num-chunks	int	No (default: 8)	The number of chunks for distributed processing
--save-dir	str	No (default: "./work_dirs/llava_video")	Directory to save processed results
--video-dir	str	No (default: "~/.cache/jobs.mp4")	Directory or path to video files
--model-path	str	No (default: "lmms-lab/LLaVA-NeXT-Video-7B")	Model path for video processing
--num-frames	int	No (default: 16)	Number of frames to extract from each video
--mm_spatial_pool_stride	int	No (default: 2)	Spatial pooling stride for the vision module

Outputs

Name	Type	Description
Console output	str	Video descriptions printed to standard output
CSV files	file	Batch results saved as CSV with video_name and answer columns
Final CSV	file	Compiled final results per chunk (final_results_chunk_{idx}.csv)

Usage Examples

# Install dependency
# pip install opencv-python-headless

# Run single video processing
# python3 srt_example_llava_v.py

# Run with custom model and frame count
# python3 srt_example_llava_v.py --model-path lmms-lab/LLaVA-NeXT-Video-7B --num-frames 32

# Programmatic usage of the video_qa function
import sglang as sgl

@sgl.function
def video_qa(s, num_frames, video_path, question):
    s += sgl.user(sgl.video(video_path, num_frames) + question)
    s += sgl.assistant(sgl.gen("answer"))

state = video_qa.run(
    num_frames=16,
    video_path="path/to/video.mp4",
    question="Describe the video.",
    temperature=0.0,
    max_new_tokens=1024,
)
print(state["answer"])

Related Pages

Environment:Sgl_project_Sglang_Multimodal

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment