Implementation:FlagOpen FlagEmbedding MLVU Choice Bench
| Knowledge Sources | |
|---|---|
| Domains | Video Understanding, Benchmark, Multiple Choice Evaluation |
| Last Updated | 2026-02-09 00:00 GMT |
Overview
Dataset loader and evaluation framework for multiple-choice video question answering tasks in the MLVU benchmark.
Description
This module provides the MLVU dataset class for multiple-choice question answering on long videos, covering seven task types: count, ego, needle, order, plotQA, anomaly recognition, and topic reasoning. It implements a PyTorch Dataset that loads video paths with formatted multiple-choice questions (A, B, C, D options) and evaluates model predictions against ground truth answers.
The dataset formats questions with labeled options and includes an answer checking function that extracts the predicted option letter and compares it to the ground truth. The evaluation framework calculates per-task and overall accuracy, with support for random baseline computation. The main function provides a template for integrating with multimodal language model inference pipelines.
Usage
Use this class to evaluate multimodal language models on multiple-choice video understanding tasks in the MLVU benchmark, particularly for assessing long-form video comprehension across diverse question types.
Code Reference
Source Location
- Repository: FlagOpen_FlagEmbedding
- File: research/MLVU/evaluation/multiple_choice_evaluation/choice_bench.py
- Lines: 1-213
Signature
class MLVU(Dataset):
def __init__(self, data_dir, data_list):
"""Initialize MLVU dataset"""
def qa_template(self, data):
"""Format multiple choice question with options"""
def __getitem__(self, idx):
"""Get video path, question, and answer"""
def check_ans(pred, gt):
"""Check if prediction matches ground truth"""
Import
import torch
from torch.utils.data import Dataset
import json
from tqdm import tqdm
import numpy as np
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| data_dir | str | Yes | Directory containing JSON annotation files |
| data_list | dict | Yes | Dictionary mapping task types to (json_file, video_prefix, data_type) |
Outputs
| Name | Type | Description |
|---|---|---|
| video | str | Path to video file |
| question | str | Formatted multiple-choice question with options |
| answer | str | Ground truth answer with option letter |
| task_type | str | Type of task (count, ego, needle, order, plotQA, anomaly_reco, topic_reasoning) |
Usage Examples
# Dataset initialization
data_list = {
"count": ("4_count.json", "/MLVU_all/video/count", "video"),
"ego": ("3_ego.json", "/MLVU_all/video/ego", "video"),
"needle": ("2_needle.json", "/MLVU_all/video/needle", "video"),
"order": ("5_order.json", "/MLVU_all/video/order", "video"),
"plotQA": ("1_plotQA.json", "/MLVU_all/video/plotQA", "video"),
"anomaly_reco": ("6_anomaly_reco.json", "/MLVU_all/video/anomaly_reco", "video"),
"topic_reasoning": ("7_topic_reasoning.json", "/MLVU_all/video/topic_reasoning", "video")
}
data_dir = "/MLVU_all/upload_json"
dataset = MLVU(data_dir, data_list)
# Evaluate model
correct = 0
total = 0
acc_dict = {}
for example in dataset:
task_type = example['task_type']
video_path = example["video"]
question = example["question"]
# Run model inference
# pred = model.generate(video_path, question)
gt = example['answer']
# Check answer
if check_ans(pred=pred, gt=gt):
correct += 1
acc_dict[task_type][0] += 1
# Calculate accuracy
print(f"Accuracy: {correct / total * 100:.2f}%")