Implementation:FlagOpen FlagEmbedding VideoLLaVA Choice Bench

Knowledge Sources	FlagOpen_FlagEmbedding
Domains	Video Understanding, Multiple Choice Evaluation, MLVU Benchmark
Last Updated	2026-02-09 00:00 GMT

Overview

An evaluation script for Video-LLaVA model on multiple-choice video understanding tasks from the MLVU benchmark.

Description

This implementation provides an evaluation pipeline for the Video-LLaVA-7B model on MLVU benchmark's choice-based tasks. The script evaluates video understanding across seven task types: count, ego, needle, order, plotQA, anomaly recognition, and topic reasoning. Unlike VideoChat2 implementation, this uses the LanguageBind Video-LLaVA architecture with its native video processor and conversation templates. The model is loaded with 4-bit quantization for efficient inference and uses a specific prompt format with image tokens for video frames. The evaluation focuses on selecting the best answer from multiple choices based on video content.

Usage

Use this script to evaluate Video-LLaVA models on MLVU benchmark's multiple-choice tasks. It is particularly suited for assessing long video understanding capabilities with the LLaVA conversation framework.

Code Reference

Source Location

Repository: FlagOpen_FlagEmbedding
File: research/MLVU/evaluation/models/videollava/choice_bench.py
Lines: 1-261

Key Components

class MLVU(Dataset):
    def __init__(self, data_dir, data_list):
        # Dataset for Video-LLaVA evaluation

    def qa_template(self, data):
        question = f"Question: {data['question']}\n"
        question += "Options:\n"
        for idx, c in enumerate(data['candidates']):
            question += f"({chr(ord('A') + idx)}) {c}\n"
        return question, answer

def check_ans(pred, gt):
    # Extract and compare predicted answer option

Import

# Video-LLaVA components
from videollava.constants import IMAGE_TOKEN_INDEX, DEFAULT_IMAGE_TOKEN
from videollava.conversation import conv_templates, SeparatorStyle
from videollava.model.builder import load_pretrained_model
from videollava.utils import disable_torch_init
from videollava.mm_utils import tokenizer_image_token, get_model_name_from_path, KeywordsStoppingCriteria

# Standard libraries
import torch
import json
from tqdm import tqdm
import os

I/O Contract

Inputs

Name	Type	Required	Description
data_dir	str	Yes	Directory containing MLVU JSON annotation files
data_list	dict	Yes	Task type to (json_file, video_dir, data_type) mapping
model_path	str	Yes	Path to Video-LLaVA model (default: LanguageBind/Video-LLaVA-7B)
device	str	Yes	CUDA device identifier (e.g., 'cuda:6')
load_4bit	bool	No	Enable 4-bit quantization (default: True)

Outputs

Name	Type	Description
test_all_choice.json	JSON file	Accuracy dictionary and detailed results list
bench_all.json	JSON file	Per-task accuracy and average accuracy
Console output	text	Part accuracy and progress information per task

Usage Examples

# Initialize model
disable_torch_init()

model_path = 'LanguageBind/Video-LLaVA-7B'
device = 'cuda:6'
load_4bit, load_8bit = True, False

tokenizer, model, processor, _ = load_pretrained_model(
    model_path, None, get_model_name_from_path(model_path),
    load_8bit, load_4bit, device=device
)
video_processor = processor['video']

# Setup conversation
conv_mode = "llava_v1"
conv = conv_templates[conv_mode].copy()

# Process video and generate prediction
video_tensor = video_processor(video_path, return_tensors='pt')['pixel_values']
tensor = video_tensor.to(model.device, dtype=torch.float16)

inp = ' '.join([DEFAULT_IMAGE_TOKEN] * model.get_video_tower().config.num_frames) + '\n' + question
conv.system = "Carefully watch this video and pay attention to every detail."
conv.append_message(conv.roles[0], inp)
conv.append_message(conv.roles[1], "Best Option: (")

prompt = get_prompt2(conv)
input_ids = tokenizer_image_token(prompt, tokenizer, IMAGE_TOKEN_INDEX, return_tensors='pt')

output_ids = model.generate(
    input_ids,
    images=tensor,
    do_sample=True,
    temperature=0.1,
    max_new_tokens=1024
)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment