Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:FlagOpen FlagEmbedding MLVU Choice Bench

From Leeroopedia


Knowledge Sources
Domains Video Understanding, Benchmark, Multiple Choice Evaluation
Last Updated 2026-02-09 00:00 GMT

Overview

Dataset loader and evaluation framework for multiple-choice video question answering tasks in the MLVU benchmark.

Description

This module provides the MLVU dataset class for multiple-choice question answering on long videos, covering seven task types: count, ego, needle, order, plotQA, anomaly recognition, and topic reasoning. It implements a PyTorch Dataset that loads video paths with formatted multiple-choice questions (A, B, C, D options) and evaluates model predictions against ground truth answers.

The dataset formats questions with labeled options and includes an answer checking function that extracts the predicted option letter and compares it to the ground truth. The evaluation framework calculates per-task and overall accuracy, with support for random baseline computation. The main function provides a template for integrating with multimodal language model inference pipelines.

Usage

Use this class to evaluate multimodal language models on multiple-choice video understanding tasks in the MLVU benchmark, particularly for assessing long-form video comprehension across diverse question types.

Code Reference

Source Location

Signature

class MLVU(Dataset):
    def __init__(self, data_dir, data_list):
        """Initialize MLVU dataset"""

    def qa_template(self, data):
        """Format multiple choice question with options"""

    def __getitem__(self, idx):
        """Get video path, question, and answer"""

def check_ans(pred, gt):
    """Check if prediction matches ground truth"""

Import

import torch
from torch.utils.data import Dataset
import json
from tqdm import tqdm
import numpy as np

I/O Contract

Inputs

Name Type Required Description
data_dir str Yes Directory containing JSON annotation files
data_list dict Yes Dictionary mapping task types to (json_file, video_prefix, data_type)

Outputs

Name Type Description
video str Path to video file
question str Formatted multiple-choice question with options
answer str Ground truth answer with option letter
task_type str Type of task (count, ego, needle, order, plotQA, anomaly_reco, topic_reasoning)

Usage Examples

# Dataset initialization
data_list = {
    "count": ("4_count.json", "/MLVU_all/video/count", "video"),
    "ego": ("3_ego.json", "/MLVU_all/video/ego", "video"),
    "needle": ("2_needle.json", "/MLVU_all/video/needle", "video"),
    "order": ("5_order.json", "/MLVU_all/video/order", "video"),
    "plotQA": ("1_plotQA.json", "/MLVU_all/video/plotQA", "video"),
    "anomaly_reco": ("6_anomaly_reco.json", "/MLVU_all/video/anomaly_reco", "video"),
    "topic_reasoning": ("7_topic_reasoning.json", "/MLVU_all/video/topic_reasoning", "video")
}

data_dir = "/MLVU_all/upload_json"
dataset = MLVU(data_dir, data_list)

# Evaluate model
correct = 0
total = 0
acc_dict = {}

for example in dataset:
    task_type = example['task_type']
    video_path = example["video"]
    question = example["question"]

    # Run model inference
    # pred = model.generate(video_path, question)

    gt = example['answer']

    # Check answer
    if check_ans(pred=pred, gt=gt):
        correct += 1
        acc_dict[task_type][0] += 1

# Calculate accuracy
print(f"Accuracy: {correct / total * 100:.2f}%")

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment