Implementation:FlagOpen FlagEmbedding MLVU Choice Bench

Knowledge Sources	FlagOpen_FlagEmbedding
Domains	Video Understanding, Benchmark, Multiple Choice Evaluation
Last Updated	2026-02-09 00:00 GMT

Overview

Dataset loader and evaluation framework for multiple-choice video question answering tasks in the MLVU benchmark.

Description

This module provides the MLVU dataset class for multiple-choice question answering on long videos, covering seven task types: count, ego, needle, order, plotQA, anomaly recognition, and topic reasoning. It implements a PyTorch Dataset that loads video paths with formatted multiple-choice questions (A, B, C, D options) and evaluates model predictions against ground truth answers.

The dataset formats questions with labeled options and includes an answer checking function that extracts the predicted option letter and compares it to the ground truth. The evaluation framework calculates per-task and overall accuracy, with support for random baseline computation. The main function provides a template for integrating with multimodal language model inference pipelines.

Usage

Use this class to evaluate multimodal language models on multiple-choice video understanding tasks in the MLVU benchmark, particularly for assessing long-form video comprehension across diverse question types.

Code Reference

Source Location

Repository: FlagOpen_FlagEmbedding
File: research/MLVU/evaluation/multiple_choice_evaluation/choice_bench.py
Lines: 1-213

Signature

class MLVU(Dataset):
    def __init__(self, data_dir, data_list):
        """Initialize MLVU dataset"""

    def qa_template(self, data):
        """Format multiple choice question with options"""

    def __getitem__(self, idx):
        """Get video path, question, and answer"""

def check_ans(pred, gt):
    """Check if prediction matches ground truth"""

Import

import torch
from torch.utils.data import Dataset
import json
from tqdm import tqdm
import numpy as np

I/O Contract

Inputs

Name	Type	Required	Description
data_dir	str	Yes	Directory containing JSON annotation files
data_list	dict	Yes	Dictionary mapping task types to (json_file, video_prefix, data_type)

Outputs

Name	Type	Description
video	str	Path to video file
question	str	Formatted multiple-choice question with options
answer	str	Ground truth answer with option letter
task_type	str	Type of task (count, ego, needle, order, plotQA, anomaly_reco, topic_reasoning)

Usage Examples

# Dataset initialization
data_list = {
    "count": ("4_count.json", "/MLVU_all/video/count", "video"),
    "ego": ("3_ego.json", "/MLVU_all/video/ego", "video"),
    "needle": ("2_needle.json", "/MLVU_all/video/needle", "video"),
    "order": ("5_order.json", "/MLVU_all/video/order", "video"),
    "plotQA": ("1_plotQA.json", "/MLVU_all/video/plotQA", "video"),
    "anomaly_reco": ("6_anomaly_reco.json", "/MLVU_all/video/anomaly_reco", "video"),
    "topic_reasoning": ("7_topic_reasoning.json", "/MLVU_all/video/topic_reasoning", "video")
}

data_dir = "/MLVU_all/upload_json"
dataset = MLVU(data_dir, data_list)

# Evaluate model
correct = 0
total = 0
acc_dict = {}

for example in dataset:
    task_type = example['task_type']
    video_path = example["video"]
    question = example["question"]

    # Run model inference
    # pred = model.generate(video_path, question)

    gt = example['answer']

    # Check answer
    if check_ans(pred=pred, gt=gt):
        correct += 1
        acc_dict[task_type][0] += 1

# Calculate accuracy
print(f"Accuracy: {correct / total * 100:.2f}%")

Related Pages

Principle:FlagOpen_FlagEmbedding_Long_Video_Understanding_Evaluation

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment