Implementation:FlagOpen FlagEmbedding VideoLLaVA Open Bench

Knowledge Sources	FlagOpen_FlagEmbedding
Domains	Video Understanding, Open-ended Evaluation, MLVU Benchmark
Last Updated	2026-02-09 00:00 GMT

Overview

An evaluation script for Video-LLaVA model on open-ended video understanding tasks from the MLVU benchmark.

Description

This implementation provides an evaluation pipeline for Video-LLaVA-7B model on MLVU benchmark's open-ended tasks, specifically for subplot identification and video summarization. The script uses the LanguageBind Video-LLaVA architecture with 4-bit quantization to generate free-form textual responses about video content. Unlike the choice-based evaluation, this version allows the model to generate detailed descriptions and summaries without being constrained to predefined options. The implementation processes videos through the Video-LLaVA processor, tokenizes with image tokens, and generates extended responses (up to 1024 tokens) for subplot and summary tasks.

Usage

Use this script to evaluate Video-LLaVA models on MLVU benchmark tasks requiring descriptive text generation, particularly for understanding sub-scenes and creating video summaries.

Code Reference

Source Location

Repository: FlagOpen_FlagEmbedding
File: research/MLVU/evaluation/models/videollava/open_bench.py
Lines: 1-200

Key Components

class MLVU(Dataset):
    def __init__(self, data_dir, data_list):
        # Dataset for open-ended Video-LLaVA evaluation

    def qa_template(self, data):
        # Simple template for open-ended questions
        question = f"{data['question']}"
        answer = data['answer']
        return question, answer

# Main evaluation loop
for example in dataset:
    video_tensor = video_processor(video_path, return_tensors='pt')['pixel_values']
    conv.append_message(conv.roles[0], inp)
    conv.append_message(conv.roles[1], None)  # Open-ended response

Import

# Video-LLaVA components
from videollava.constants import IMAGE_TOKEN_INDEX, DEFAULT_IMAGE_TOKEN
from videollava.conversation import conv_templates, SeparatorStyle
from videollava.model.builder import load_pretrained_model
from videollava.utils import disable_torch_init
from videollava.mm_utils import tokenizer_image_token, get_model_name_from_path, KeywordsStoppingCriteria

# Standard libraries
import torch
import json
from tqdm import tqdm
import os

I/O Contract

Inputs

Name	Type	Required	Description
data_dir	str	Yes	Directory with subplot and summary JSON files
data_list	dict	Yes	Mapping of subPlot and summary task configurations
model_path	str	Yes	Path to Video-LLaVA model (LanguageBind/Video-LLaVA-7B)
device	str	Yes	CUDA device (e.g., 'cuda:6')
load_4bit	bool	No	Enable 4-bit quantization (default: True)

Outputs

Name	Type	Description
subplot_all.json	JSON file	Subplot predictions with video names, questions, answers, and predictions
summary_all.json	JSON file	Summary predictions with video names, questions, answers, and predictions
Console output	text	Ground truth and predictions during inference

Usage Examples

# Initialize Video-LLaVA model
disable_torch_init()

data_list = {
    "subPlot": ("8_sub_scene.json", "/LVBench_all/video/subPlot", "video"),
    "summary": ("9_summary.json", "/LVBench_all/video/summary", "video")
}

model_path = 'LanguageBind/Video-LLaVA-7B'
device = 'cuda:6'
load_4bit = True

tokenizer, model, processor, _ = load_pretrained_model(
    model_path, None, get_model_name_from_path(model_path),
    False, load_4bit, device=device
)
video_processor = processor['video']

# Process video for open-ended generation
conv_mode = "llava_v1"
conv = conv_templates[conv_mode].copy()

video_tensor = video_processor(video_path, return_tensors='pt')['pixel_values']
tensor = video_tensor.to(model.device, dtype=torch.float16)

# Prepare input with image tokens
inp = ' '.join([DEFAULT_IMAGE_TOKEN] * model.get_video_tower().config.num_frames) + '\n' + question
conv.system = "Carefully watch this video and pay attention to every detail. Based on your observations, answer the given questions."
conv.append_message(conv.roles[0], inp)
conv.append_message(conv.roles[1], None)  # Open-ended

# Generate response
output_ids = model.generate(
    input_ids,
    images=tensor,
    do_sample=True,
    temperature=0.1,
    max_new_tokens=1024
)

pred = tokenizer.decode(output_ids[0, input_ids.shape[1]:]).strip().replace("</s>", "")

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment