Overview
Utility functions for evaluating video-language models on Video-MME benchmark, which tests comprehensive video understanding across 6 domains, 19 sub-categories, 12 task types, and 3 video durations.
Description
This module provides evaluation infrastructure for Video-MME, a large-scale video understanding benchmark with hierarchical categorization. It supports multiple-choice questions across temporal/spatial/attribute perception, action/object recognition, OCR, counting, and various reasoning tasks. The module handles video loading, optional subtitle extraction with frame-aligned text filtering, answer extraction using regex patterns, and comprehensive aggregation reporting per video type (short/medium/long), domain (Knowledge, Film & TV, Sports, Performance, Life, Multilingual), sub-category (19 types), and task category (12 types).
Usage
Use this when evaluating video-language models on comprehensive video understanding tasks. Supports optional subtitle integration via videomme_doc_to_text_subtitle for enhanced context. Includes special Qwen3-VL prompt formatting when format="qwen3_vl" is specified. Dataset filtering available via videomme_process_docs_long for duration-specific evaluation.
Code Reference
Source Location
Signature
def videomme_doc_to_visual(doc: Dict) -> List[str]
def videomme_doc_to_text(
doc: Dict,
lmms_eval_specific_kwargs: Optional[Dict] = None
) -> str
def videomme_doc_to_text_subtitle(
doc: Dict,
lmms_eval_specific_kwargs: Optional[Dict] = None
) -> str
def videomme_doc_to_text_qwen3vl(
doc: Dict,
lmms_eval_specific_kwargs: Optional[Dict] = None
) -> str
def videomme_process_results(doc: Dict, results: List[str]) -> Dict[str, Dict]
def videomme_aggregate_results(results: List[Dict]) -> float
def videmme_process_docs_base(dataset: datasets.Dataset, type: str) -> datasets.Dataset
def videomme_process_docs_long(dataset: datasets.Dataset) -> datasets.Dataset
def extract_subtitles(video_path: str, subtitle_path: str) -> Tuple[List[Tuple[int, int, str]], int]
def extract_characters_regex(s: str) -> str
Import
from lmms_eval.tasks.videomme.utils import (
videomme_doc_to_visual,
videomme_doc_to_text,
videomme_doc_to_text_subtitle,
videomme_process_results,
videomme_aggregate_results,
videomme_process_docs_long
)
I/O Contract
videomme_doc_to_text Input
| Field |
Type |
Description
|
| doc["videoID"] |
str |
Video identifier
|
| doc["question"] |
str |
Question text
|
| doc["options"] |
List[str] |
Multiple choice options (e.g., ["A. option1", "B. option2", ...])
|
videomme_doc_to_text_subtitle Additional Input
| Field |
Type |
Description
|
| lmms_eval_specific_kwargs["frame_num"] |
int |
Number of frames to sample (-1 for all)
|
| lmms_eval_specific_kwargs["gemini_api_flag"] |
str |
If "full subtitle", extract all subtitle text
|
| lmms_eval_specific_kwargs["format"] |
str |
If "qwen3_vl", use Qwen3-VL prompt format
|
videomme_process_results Input
| Field |
Type |
Description
|
| doc["question_id"] |
str |
Unique question identifier
|
| doc["duration"] |
str |
Video length: "short", "medium", or "long"
|
| doc["domain"] |
str |
Main category (Knowledge, Film & Television, etc.)
|
| doc["sub_category"] |
str |
Sub-category (19 types)
|
| doc["task_type"] |
str |
Task category (12 types)
|
| doc["answer"] |
str |
Ground truth answer letter (A-D)
|
| doc["videoID"] |
str |
Video identifier for clustered stderr
|
| results |
List[str] |
Model prediction strings
|
videomme_process_results Output
| Field |
Type |
Description
|
| videomme_perception_score |
Dict |
Contains question_id, duration, category, sub_category, task_category, pred_answer, answer, score, videoID
|
Taxonomy
| Type |
Values
|
| VIDEO_TYPE |
short, medium, long
|
| CATEGORIES |
Knowledge, Film & Television, Sports Competition, Artistic Performance, Life Record, Multilingual
|
| SUB_CATEGORIES (19) |
Humanity & History, Literature & Art, Biology & Medicine, Finance & Commerce, Astronomy, Geography, Law, Life Tip, Technology, Animation, Movie & TV Show, Documentary, News Report, Esports, Basketball, Football, Athletics, Other Sports, Stage Play, Magic Show, Variety Show, Acrobatics, Handicraft, Food, Fashion, Daily Life, Travel, Pet & Animal, Exercise, Multilingual
|
| TASK_CATEGORIES (12) |
Temporal Perception, Spatial Perception, Attribute Perception, Action Recognition, Object Recognition, OCR Problems, Counting Problem, Temporal Reasoning, Spatial Reasoning, Action Reasoning, Object Reasoning, Information Synopsis
|
Usage Examples
from lmms_eval.tasks.videomme.utils import (
videomme_doc_to_visual,
videomme_doc_to_text,
videomme_doc_to_text_subtitle,
videomme_process_results,
videomme_aggregate_results,
videomme_process_docs_long
)
import datasets
# Load video path
doc = {
"videoID": "video_001",
"question": "What action is the person performing in the video?",
"options": ["A. Running", "B. Walking", "C. Jumping", "D. Standing"],
"answer": "B",
"duration": "short",
"domain": "Life Record",
"sub_category": "Daily Life",
"task_type": "Action Recognition",
"question_id": "q001"
}
video_paths = videomme_doc_to_visual(doc)
print(video_paths[0]) # /path/to/cache/data/video_001.mp4
# Standard prompt without subtitles
prompt = videomme_doc_to_text(doc, {"post_prompt": "Answer:"})
print(prompt)
# "Select the best answer to the following multiple-choice question based on
# the video and the subtitles. Respond with only the letter (A, B, C, or D)
# of the correct option.
# What action is the person performing in the video?
# A. Running
# B. Walking
# C. Jumping
# D. Standing
# Answer:"
# Prompt with frame-aligned subtitles
subtitle_prompt = videomme_doc_to_text_subtitle(doc, {"frame_num": 8})
print("This video's subtitles are listed below:" in subtitle_prompt) # True
# Qwen3-VL specific formatting
qwen_prompt = videomme_doc_to_text_subtitle(
doc,
{"format": "qwen3_vl", "pre_prompt": "Answer the question: ", "post_prompt": ""}
)
# Filter dataset for long videos only
full_dataset = datasets.load_dataset("...")
long_videos = videomme_process_docs_long(full_dataset)
print(f"Filtered to {len(long_videos)} long videos")
# Process model results
model_output = ["Based on the video, the person is walking. The answer is B."]
result = videomme_process_results(doc, model_output)
print(result["videomme_perception_score"]["pred_answer"]) # "B"
print(result["videomme_perception_score"]["score"]) # 1.0 (correct)
# Aggregate results with comprehensive breakdown
all_results = [
{"question_id": "q001", "duration": "short", "category": "Life Record",
"sub_category": "Daily Life", "task_category": "Action Recognition",
"pred_answer": "B", "answer": "B", "score": 1.0, "videoID": "vid_001"},
{"question_id": "q002", "duration": "medium", "category": "Knowledge",
"sub_category": "Geography", "task_category": "Information Synopsis",
"pred_answer": "A", "answer": "C", "score": 0.0, "videoID": "vid_002"},
{"question_id": "q003", "duration": "long", "category": "Sports Competition",
"sub_category": "Basketball", "task_category": "Temporal Reasoning",
"pred_answer": "D", "answer": "D", "score": 1.0, "videoID": "vid_003"}
]
overall_acc = videomme_aggregate_results(all_results)
print(f"Overall Accuracy: {overall_acc:.1f}%") # 66.7%
# Also prints:
# - Per video type (short/medium/long)
# - Per domain (6 categories)
# - Per sub-category (19 types)
# - Per task category (12 types)
# Extract answer from various formats
from lmms_eval.tasks.videomme.utils import extract_characters_regex
assert extract_characters_regex("The best answer is B") == "B"
assert extract_characters_regex("B. Walking is correct") == "B"
assert extract_characters_regex("I think the answer should be C.") == "C"
Related Pages