Implementation:FlagOpen FlagEmbedding VideoLLaVA Choice Bench
| Knowledge Sources | |
|---|---|
| Domains | Video Understanding, Multiple Choice Evaluation, MLVU Benchmark |
| Last Updated | 2026-02-09 00:00 GMT |
Overview
An evaluation script for Video-LLaVA model on multiple-choice video understanding tasks from the MLVU benchmark.
Description
This implementation provides an evaluation pipeline for the Video-LLaVA-7B model on MLVU benchmark's choice-based tasks. The script evaluates video understanding across seven task types: count, ego, needle, order, plotQA, anomaly recognition, and topic reasoning. Unlike VideoChat2 implementation, this uses the LanguageBind Video-LLaVA architecture with its native video processor and conversation templates. The model is loaded with 4-bit quantization for efficient inference and uses a specific prompt format with image tokens for video frames. The evaluation focuses on selecting the best answer from multiple choices based on video content.
Usage
Use this script to evaluate Video-LLaVA models on MLVU benchmark's multiple-choice tasks. It is particularly suited for assessing long video understanding capabilities with the LLaVA conversation framework.
Code Reference
Source Location
- Repository: FlagOpen_FlagEmbedding
- File: research/MLVU/evaluation/models/videollava/choice_bench.py
- Lines: 1-261
Key Components
class MLVU(Dataset):
def __init__(self, data_dir, data_list):
# Dataset for Video-LLaVA evaluation
def qa_template(self, data):
question = f"Question: {data['question']}\n"
question += "Options:\n"
for idx, c in enumerate(data['candidates']):
question += f"({chr(ord('A') + idx)}) {c}\n"
return question, answer
def check_ans(pred, gt):
# Extract and compare predicted answer option
Import
# Video-LLaVA components
from videollava.constants import IMAGE_TOKEN_INDEX, DEFAULT_IMAGE_TOKEN
from videollava.conversation import conv_templates, SeparatorStyle
from videollava.model.builder import load_pretrained_model
from videollava.utils import disable_torch_init
from videollava.mm_utils import tokenizer_image_token, get_model_name_from_path, KeywordsStoppingCriteria
# Standard libraries
import torch
import json
from tqdm import tqdm
import os
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| data_dir | str | Yes | Directory containing MLVU JSON annotation files |
| data_list | dict | Yes | Task type to (json_file, video_dir, data_type) mapping |
| model_path | str | Yes | Path to Video-LLaVA model (default: LanguageBind/Video-LLaVA-7B) |
| device | str | Yes | CUDA device identifier (e.g., 'cuda:6') |
| load_4bit | bool | No | Enable 4-bit quantization (default: True) |
Outputs
| Name | Type | Description |
|---|---|---|
| test_all_choice.json | JSON file | Accuracy dictionary and detailed results list |
| bench_all.json | JSON file | Per-task accuracy and average accuracy |
| Console output | text | Part accuracy and progress information per task |
Usage Examples
# Initialize model
disable_torch_init()
model_path = 'LanguageBind/Video-LLaVA-7B'
device = 'cuda:6'
load_4bit, load_8bit = True, False
tokenizer, model, processor, _ = load_pretrained_model(
model_path, None, get_model_name_from_path(model_path),
load_8bit, load_4bit, device=device
)
video_processor = processor['video']
# Setup conversation
conv_mode = "llava_v1"
conv = conv_templates[conv_mode].copy()
# Process video and generate prediction
video_tensor = video_processor(video_path, return_tensors='pt')['pixel_values']
tensor = video_tensor.to(model.device, dtype=torch.float16)
inp = ' '.join([DEFAULT_IMAGE_TOKEN] * model.get_video_tower().config.num_frames) + '\n' + question
conv.system = "Carefully watch this video and pay attention to every detail."
conv.append_message(conv.roles[0], inp)
conv.append_message(conv.roles[1], "Best Option: (")
prompt = get_prompt2(conv)
input_ids = tokenizer_image_token(prompt, tokenizer, IMAGE_TOKEN_INDEX, return_tensors='pt')
output_ids = model.generate(
input_ids,
images=tensor,
do_sample=True,
temperature=0.1,
max_new_tokens=1024
)