Implementation:FlagOpen FlagEmbedding VideoLLaVA Open Bench
| Knowledge Sources | |
|---|---|
| Domains | Video Understanding, Open-ended Evaluation, MLVU Benchmark |
| Last Updated | 2026-02-09 00:00 GMT |
Overview
An evaluation script for Video-LLaVA model on open-ended video understanding tasks from the MLVU benchmark.
Description
This implementation provides an evaluation pipeline for Video-LLaVA-7B model on MLVU benchmark's open-ended tasks, specifically for subplot identification and video summarization. The script uses the LanguageBind Video-LLaVA architecture with 4-bit quantization to generate free-form textual responses about video content. Unlike the choice-based evaluation, this version allows the model to generate detailed descriptions and summaries without being constrained to predefined options. The implementation processes videos through the Video-LLaVA processor, tokenizes with image tokens, and generates extended responses (up to 1024 tokens) for subplot and summary tasks.
Usage
Use this script to evaluate Video-LLaVA models on MLVU benchmark tasks requiring descriptive text generation, particularly for understanding sub-scenes and creating video summaries.
Code Reference
Source Location
- Repository: FlagOpen_FlagEmbedding
- File: research/MLVU/evaluation/models/videollava/open_bench.py
- Lines: 1-200
Key Components
class MLVU(Dataset):
def __init__(self, data_dir, data_list):
# Dataset for open-ended Video-LLaVA evaluation
def qa_template(self, data):
# Simple template for open-ended questions
question = f"{data['question']}"
answer = data['answer']
return question, answer
# Main evaluation loop
for example in dataset:
video_tensor = video_processor(video_path, return_tensors='pt')['pixel_values']
conv.append_message(conv.roles[0], inp)
conv.append_message(conv.roles[1], None) # Open-ended response
Import
# Video-LLaVA components
from videollava.constants import IMAGE_TOKEN_INDEX, DEFAULT_IMAGE_TOKEN
from videollava.conversation import conv_templates, SeparatorStyle
from videollava.model.builder import load_pretrained_model
from videollava.utils import disable_torch_init
from videollava.mm_utils import tokenizer_image_token, get_model_name_from_path, KeywordsStoppingCriteria
# Standard libraries
import torch
import json
from tqdm import tqdm
import os
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| data_dir | str | Yes | Directory with subplot and summary JSON files |
| data_list | dict | Yes | Mapping of subPlot and summary task configurations |
| model_path | str | Yes | Path to Video-LLaVA model (LanguageBind/Video-LLaVA-7B) |
| device | str | Yes | CUDA device (e.g., 'cuda:6') |
| load_4bit | bool | No | Enable 4-bit quantization (default: True) |
Outputs
| Name | Type | Description |
|---|---|---|
| subplot_all.json | JSON file | Subplot predictions with video names, questions, answers, and predictions |
| summary_all.json | JSON file | Summary predictions with video names, questions, answers, and predictions |
| Console output | text | Ground truth and predictions during inference |
Usage Examples
# Initialize Video-LLaVA model
disable_torch_init()
data_list = {
"subPlot": ("8_sub_scene.json", "/LVBench_all/video/subPlot", "video"),
"summary": ("9_summary.json", "/LVBench_all/video/summary", "video")
}
model_path = 'LanguageBind/Video-LLaVA-7B'
device = 'cuda:6'
load_4bit = True
tokenizer, model, processor, _ = load_pretrained_model(
model_path, None, get_model_name_from_path(model_path),
False, load_4bit, device=device
)
video_processor = processor['video']
# Process video for open-ended generation
conv_mode = "llava_v1"
conv = conv_templates[conv_mode].copy()
video_tensor = video_processor(video_path, return_tensors='pt')['pixel_values']
tensor = video_tensor.to(model.device, dtype=torch.float16)
# Prepare input with image tokens
inp = ' '.join([DEFAULT_IMAGE_TOKEN] * model.get_video_tower().config.num_frames) + '\n' + question
conv.system = "Carefully watch this video and pay attention to every detail. Based on your observations, answer the given questions."
conv.append_message(conv.roles[0], inp)
conv.append_message(conv.roles[1], None) # Open-ended
# Generate response
output_ids = model.generate(
input_ids,
images=tensor,
do_sample=True,
temperature=0.1,
max_new_tokens=1024
)
pred = tokenizer.decode(output_ids[0, input_ids.shape[1]:]).strip().replace("</s>", "")