Overview
Utility functions for evaluating video-language models on VideoMathQA benchmark, which tests mathematical reasoning with video context across multiple categories and video durations.
Description
This module provides evaluation infrastructure for VideoMathQA, a mathematical reasoning benchmark using educational videos. It supports multiple-choice (MCQ) and multi-binary question formats across 10 categories (Geometry Angle/Area/Length, Chart, Statistics, Arithmetic, Topology, Graph Theory, Counting, Puzzle) and three video durations (short, medium, long). The module handles video loading with uniform frame sampling, optional subtitle extraction and frame-aligned text, and answer extraction using regex patterns. Aggregation reports per-category, per-duration, and overall accuracy with special handling for multi-binary questions requiring all sub-questions correct.
Usage
Use this when evaluating video-language models on mathematical reasoning tasks with video context. The module supports two evaluation modes: (1) MCQ - standard multiple-choice with per-question accuracy, (2) Multi-binary - grouped questions where all must be correct for credit. Subtitle support is available via videomathqa_doc_to_text_subtitle for models that benefit from textual cues.
Code Reference
Source Location
Signature
def videomathqa_doc_to_visual(doc: Dict) -> List[str]
def videomathqa_doc_to_text(
doc: Dict,
lmms_eval_specific_kwargs: Optional[Dict] = None
) -> str
def videomathqa_doc_to_text_subtitle(
doc: Dict,
lmms_eval_specific_kwargs: Optional[Dict] = None
) -> str
def videomathqa_process_results(doc: Dict, results: List[str]) -> Dict[str, Dict]
def videomathqa_mcq_aggregate_results(results: List[Dict]) -> float
def videomathqa_multi_binary_aggregate_results(results: List[Dict]) -> float
def load_video(video_path: str, max_frames: int, annot_sample_rate: int = 1) -> Tuple[List[np.ndarray], List[int]]
def extract_subtitles(video_path: str, subtitle_path: str) -> Tuple[List[Tuple[int, int, str]], int]
def extract_characters_regex(s: str) -> str
Import
from lmms_eval.tasks.videomathqa.utils import (
videomathqa_doc_to_visual,
videomathqa_doc_to_text,
videomathqa_doc_to_text_subtitle,
videomathqa_process_results,
videomathqa_mcq_aggregate_results,
videomathqa_multi_binary_aggregate_results
)
I/O Contract
videomathqa_doc_to_text Input
| Field |
Type |
Description
|
| doc["videoID"] |
str |
Video identifier
|
| doc["question"] |
str |
Mathematical question text
|
| doc["options"] |
List[str] |
Multiple choice options (e.g., ["A. 30", "B. 45", "C. 60"])
|
videomathqa_doc_to_text_subtitle Input
| Additional Field |
Type |
Description
|
| lmms_eval_specific_kwargs["frame_num"] |
int |
Number of frames to sample (-1 for all frames)
|
| lmms_eval_specific_kwargs["gemini_api_flag"] |
str |
If "full subtitle", extract all subtitles
|
videomathqa_process_results Input
| Field |
Type |
Description
|
| doc["question_id"] |
str |
Unique question identifier
|
| doc["length"] |
str |
Video duration: "short", "medium", or "long"
|
| doc["category"] |
str |
Math category (e.g., "Geometry Angle", "Counting")
|
| doc["answer"] |
str |
Ground truth answer letter (A-E)
|
| results |
List[str] |
Model prediction strings
|
videomathqa_process_results Output
| Field |
Type |
Description
|
| videomathqa_perception_score |
Dict |
Contains question_id, duration, category, pred_answer, answer
|
Categories and Durations
| Category List |
Duration List
|
| Geometry Angle, Geometry Area, Geometry Length, Chart, Statistics, Arithmetic, Topology, Graph Theory, Counting, Puzzle |
short, medium, long
|
Usage Examples
from lmms_eval.tasks.videomathqa.utils import (
videomathqa_doc_to_visual,
videomathqa_doc_to_text,
videomathqa_doc_to_text_subtitle,
videomathqa_process_results,
videomathqa_mcq_aggregate_results,
extract_subtitles
)
# Load video for model
doc = {
"videoID": "math_video_001",
"question": "What is the angle measure shown in the diagram?",
"options": ["A. 30 degrees", "B. 45 degrees", "C. 60 degrees", "D. 90 degrees"],
"answer": "B",
"length": "short",
"category": "Geometry Angle",
"question_id": "q001"
}
video_paths = videomathqa_doc_to_visual(doc)
print(video_paths[0]) # /path/to/cache/videos/math_video_001.mp4
# Generate prompt without subtitles
prompt = videomathqa_doc_to_text(doc, {"post_prompt": "Answer:"})
print(prompt)
# "Select the best answer to the following multiple-choice question based on the video.
# Respond with the letter (A, B, C, D or E) of the correct option.
# What is the angle measure shown in the diagram?
# A. 30 degrees
# B. 45 degrees
# C. 60 degrees
# D. 90 degrees
# Answer:"
# Generate prompt with frame-aligned subtitles
subtitle_prompt = videomathqa_doc_to_text_subtitle(
doc,
{"frame_num": 8, "post_prompt": "Answer:"}
)
print("Subtitle included" in subtitle_prompt) # True
# Extract subtitles for specific frames
video_path = "/path/to/video.mp4"
subtitle_path = "/path/to/subtitle.srt"
subtitle_frames, total_frames = extract_subtitles(video_path, subtitle_path)
# subtitle_frames: [(start_frame, end_frame, text), ...]
print(f"Total frames: {total_frames}, Subtitle segments: {len(subtitle_frames)}")
# Process model results
model_output = ["The angle measure is 45 degrees, so the answer is B."]
result = videomathqa_process_results(doc, model_output)
print(result["videomathqa_perception_score"]["pred_answer"]) # "B"
# MCQ aggregation (per-question scoring)
all_results = [
{"question_id": "q001", "duration": "short", "category": "Geometry Angle",
"pred_answer": "B", "answer": "B"},
{"question_id": "q002", "duration": "short", "category": "Geometry Angle",
"pred_answer": "A", "answer": "B"},
{"question_id": "q003", "duration": "medium", "category": "Counting",
"pred_answer": "C", "answer": "C"}
]
mcq_score = videomathqa_mcq_aggregate_results(all_results)
print(f"MCQ Accuracy: {mcq_score:.1f}%") # 66.7% (2/3 correct)
# Also prints per-category and per-duration breakdowns
# Multi-binary aggregation (grouped questions)
multi_binary_results = [
{"question_id": "q001_1", "duration": "short", "category": "Geometry Angle",
"pred_answer": "A", "answer": "A"},
{"question_id": "q001_2", "duration": "short", "category": "Geometry Angle",
"pred_answer": "B", "answer": "B"},
{"question_id": "q002_1", "duration": "short", "category": "Counting",
"pred_answer": "A", "answer": "B"} # Wrong - entire group fails
]
binary_score = videomathqa_multi_binary_aggregate_results(multi_binary_results)
print(f"Multi-Binary Accuracy: {binary_score:.1f}%") # 50% (1/2 groups fully correct)
Related Pages