Implementation:FlagOpen FlagEmbedding MLVU Open Bench
| Knowledge Sources | |
|---|---|
| Domains | Video Understanding, Benchmark, Machine Learning |
| Last Updated | 2026-02-09 00:00 GMT |
Overview
Dataset loader and evaluation framework for open-ended generation tasks in the MLVU video understanding benchmark.
Description
This module provides the MLVU dataset class and evaluation infrastructure for open-ended video question answering tasks, specifically for sub-plot analysis and video summarization. It implements a PyTorch Dataset that loads video paths, questions, and ground truth answers from JSON files, preparing data for multimodal language model inference.
The dataset handles multiple task types (subPlot and summary) and provides utilities for formatting prompts with conversation templates. It includes frame index calculation for temporal video sampling and statistical reporting of dataset composition. The main function demonstrates the inference workflow template for running models and collecting predictions.
Usage
Use this class to load MLVU benchmark data for evaluating multimodal language models on long video understanding tasks with open-ended generation. Integrate with your model's inference pipeline to generate predictions that can be evaluated using the companion evaluation scripts.
Code Reference
Source Location
- Repository: FlagOpen_FlagEmbedding
- File: research/MLVU/evaluation/generation_evaluation/open_bench.py
- Lines: 1-167
Signature
class MLVU(Dataset):
def __init__(self, data_dir, data_list):
"""Initialize MLVU dataset"""
def __getitem__(self, idx):
"""Get a single data sample"""
def qa_template(self, data):
"""Format question and answer"""
def get_prompt2(conv):
"""Generate prompt from conversation template"""
Import
import torch
from torch.utils.data import Dataset
import json
from tqdm import tqdm
import numpy as np
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| data_dir | str | Yes | Directory containing JSON annotation files |
| data_list | dict | Yes | Dictionary mapping task types to (json_file, video_prefix, data_type, flag) |
Outputs
| Name | Type | Description |
|---|---|---|
| video | str | Path to video file |
| question | str | Question text |
| answer | str | Ground truth answer |
| task_type | str | Type of task (subPlot or summary) |
Usage Examples
# Dataset initialization
data_list = {
"subPlot": ("8_sub_scene.json", "MLVU_all/video/subPlot", "video", False),
"summary": ("9_summary.json", "MLVU_all/video/summary", "video", False)
}
data_dir = "MLVU_all/json"
dataset = MLVU(data_dir, data_list)
# Print dataset statistics
print(dataset)
# Iterate through dataset
for example in dataset:
video_path = example["video"]
question = example["question"]
answer = example["answer"]
task_type = example["task_type"]
# Run your model inference here
# pred = model.generate(video_path, question)
# Collect results
result = {
"video_name": video_path.split("/")[-1],
"Q": question,
"A": answer,
"pred": pred
}