Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:FlagOpen FlagEmbedding VideoLLaVA Choice Bench

From Leeroopedia


Knowledge Sources
Domains Video Understanding, Multiple Choice Evaluation, MLVU Benchmark
Last Updated 2026-02-09 00:00 GMT

Overview

An evaluation script for Video-LLaVA model on multiple-choice video understanding tasks from the MLVU benchmark.

Description

This implementation provides an evaluation pipeline for the Video-LLaVA-7B model on MLVU benchmark's choice-based tasks. The script evaluates video understanding across seven task types: count, ego, needle, order, plotQA, anomaly recognition, and topic reasoning. Unlike VideoChat2 implementation, this uses the LanguageBind Video-LLaVA architecture with its native video processor and conversation templates. The model is loaded with 4-bit quantization for efficient inference and uses a specific prompt format with image tokens for video frames. The evaluation focuses on selecting the best answer from multiple choices based on video content.

Usage

Use this script to evaluate Video-LLaVA models on MLVU benchmark's multiple-choice tasks. It is particularly suited for assessing long video understanding capabilities with the LLaVA conversation framework.

Code Reference

Source Location

Key Components

class MLVU(Dataset):
    def __init__(self, data_dir, data_list):
        # Dataset for Video-LLaVA evaluation

    def qa_template(self, data):
        question = f"Question: {data['question']}\n"
        question += "Options:\n"
        for idx, c in enumerate(data['candidates']):
            question += f"({chr(ord('A') + idx)}) {c}\n"
        return question, answer

def check_ans(pred, gt):
    # Extract and compare predicted answer option

Import

# Video-LLaVA components
from videollava.constants import IMAGE_TOKEN_INDEX, DEFAULT_IMAGE_TOKEN
from videollava.conversation import conv_templates, SeparatorStyle
from videollava.model.builder import load_pretrained_model
from videollava.utils import disable_torch_init
from videollava.mm_utils import tokenizer_image_token, get_model_name_from_path, KeywordsStoppingCriteria

# Standard libraries
import torch
import json
from tqdm import tqdm
import os

I/O Contract

Inputs

Name Type Required Description
data_dir str Yes Directory containing MLVU JSON annotation files
data_list dict Yes Task type to (json_file, video_dir, data_type) mapping
model_path str Yes Path to Video-LLaVA model (default: LanguageBind/Video-LLaVA-7B)
device str Yes CUDA device identifier (e.g., 'cuda:6')
load_4bit bool No Enable 4-bit quantization (default: True)

Outputs

Name Type Description
test_all_choice.json JSON file Accuracy dictionary and detailed results list
bench_all.json JSON file Per-task accuracy and average accuracy
Console output text Part accuracy and progress information per task

Usage Examples

# Initialize model
disable_torch_init()

model_path = 'LanguageBind/Video-LLaVA-7B'
device = 'cuda:6'
load_4bit, load_8bit = True, False

tokenizer, model, processor, _ = load_pretrained_model(
    model_path, None, get_model_name_from_path(model_path),
    load_8bit, load_4bit, device=device
)
video_processor = processor['video']

# Setup conversation
conv_mode = "llava_v1"
conv = conv_templates[conv_mode].copy()

# Process video and generate prediction
video_tensor = video_processor(video_path, return_tensors='pt')['pixel_values']
tensor = video_tensor.to(model.device, dtype=torch.float16)

inp = ' '.join([DEFAULT_IMAGE_TOKEN] * model.get_video_tower().config.num_frames) + '\n' + question
conv.system = "Carefully watch this video and pay attention to every detail."
conv.append_message(conv.roles[0], inp)
conv.append_message(conv.roles[1], "Best Option: (")

prompt = get_prompt2(conv)
input_ids = tokenizer_image_token(prompt, tokenizer, IMAGE_TOKEN_INDEX, return_tensors='pt')

output_ids = model.generate(
    input_ids,
    images=tensor,
    do_sample=True,
    temperature=0.1,
    max_new_tokens=1024
)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment