Implementation:EvolvingLMMs Lab Lmms eval VideoMathQA Evaluation Utils

Knowledge Sources	EvolvingLMMs_Lab_Lmms_eval
Domains	Vision, Evaluation, Video_Understanding, Mathematics
Last Updated	2026-02-14 00:00 GMT

Overview

Utility functions for evaluating video-language models on VideoMathQA benchmark, which tests mathematical reasoning with video context across multiple categories and video durations.

Description

This module provides evaluation infrastructure for VideoMathQA, a mathematical reasoning benchmark using educational videos. It supports multiple-choice (MCQ) and multi-binary question formats across 10 categories (Geometry Angle/Area/Length, Chart, Statistics, Arithmetic, Topology, Graph Theory, Counting, Puzzle) and three video durations (short, medium, long). The module handles video loading with uniform frame sampling, optional subtitle extraction and frame-aligned text, and answer extraction using regex patterns. Aggregation reports per-category, per-duration, and overall accuracy with special handling for multi-binary questions requiring all sub-questions correct.

Usage

Use this when evaluating video-language models on mathematical reasoning tasks with video context. The module supports two evaluation modes: (1) MCQ - standard multiple-choice with per-question accuracy, (2) Multi-binary - grouped questions where all must be correct for credit. Subtitle support is available via videomathqa_doc_to_text_subtitle for models that benefit from textual cues.

Code Reference

Source Location

Repository: EvolvingLMMs_Lab_Lmms_eval
File: lmms_eval/tasks/videomathqa/utils.py

Signature

def videomathqa_doc_to_visual(doc: Dict) -> List[str]

def videomathqa_doc_to_text(
    doc: Dict,
    lmms_eval_specific_kwargs: Optional[Dict] = None
) -> str

def videomathqa_doc_to_text_subtitle(
    doc: Dict,
    lmms_eval_specific_kwargs: Optional[Dict] = None
) -> str

def videomathqa_process_results(doc: Dict, results: List[str]) -> Dict[str, Dict]

def videomathqa_mcq_aggregate_results(results: List[Dict]) -> float
def videomathqa_multi_binary_aggregate_results(results: List[Dict]) -> float

def load_video(video_path: str, max_frames: int, annot_sample_rate: int = 1) -> Tuple[List[np.ndarray], List[int]]
def extract_subtitles(video_path: str, subtitle_path: str) -> Tuple[List[Tuple[int, int, str]], int]
def extract_characters_regex(s: str) -> str

Import

from lmms_eval.tasks.videomathqa.utils import (
    videomathqa_doc_to_visual,
    videomathqa_doc_to_text,
    videomathqa_doc_to_text_subtitle,
    videomathqa_process_results,
    videomathqa_mcq_aggregate_results,
    videomathqa_multi_binary_aggregate_results
)

I/O Contract

videomathqa_doc_to_text Input

Field	Type	Description
doc["videoID"]	str	Video identifier
doc["question"]	str	Mathematical question text
doc["options"]	List[str]	Multiple choice options (e.g., ["A. 30", "B. 45", "C. 60"])

videomathqa_doc_to_text_subtitle Input

Additional Field	Type	Description
lmms_eval_specific_kwargs["frame_num"]	int	Number of frames to sample (-1 for all frames)
lmms_eval_specific_kwargs["gemini_api_flag"]	str	If "full subtitle", extract all subtitles

videomathqa_process_results Input

Field	Type	Description
doc["question_id"]	str	Unique question identifier
doc["length"]	str	Video duration: "short", "medium", or "long"
doc["category"]	str	Math category (e.g., "Geometry Angle", "Counting")
doc["answer"]	str	Ground truth answer letter (A-E)
results	List[str]	Model prediction strings

videomathqa_process_results Output

Field	Type	Description
videomathqa_perception_score	Dict	Contains question_id, duration, category, pred_answer, answer

Categories and Durations

Category List	Duration List
Geometry Angle, Geometry Area, Geometry Length, Chart, Statistics, Arithmetic, Topology, Graph Theory, Counting, Puzzle	short, medium, long

Usage Examples

from lmms_eval.tasks.videomathqa.utils import (
    videomathqa_doc_to_visual,
    videomathqa_doc_to_text,
    videomathqa_doc_to_text_subtitle,
    videomathqa_process_results,
    videomathqa_mcq_aggregate_results,
    extract_subtitles
)

# Load video for model
doc = {
    "videoID": "math_video_001",
    "question": "What is the angle measure shown in the diagram?",
    "options": ["A. 30 degrees", "B. 45 degrees", "C. 60 degrees", "D. 90 degrees"],
    "answer": "B",
    "length": "short",
    "category": "Geometry Angle",
    "question_id": "q001"
}
video_paths = videomathqa_doc_to_visual(doc)
print(video_paths[0])  # /path/to/cache/videos/math_video_001.mp4

# Generate prompt without subtitles
prompt = videomathqa_doc_to_text(doc, {"post_prompt": "Answer:"})
print(prompt)
# "Select the best answer to the following multiple-choice question based on the video.
# Respond with the letter (A, B, C, D or E) of the correct option.
# What is the angle measure shown in the diagram?
# A. 30 degrees
# B. 45 degrees
# C. 60 degrees
# D. 90 degrees
# Answer:"

# Generate prompt with frame-aligned subtitles
subtitle_prompt = videomathqa_doc_to_text_subtitle(
    doc,
    {"frame_num": 8, "post_prompt": "Answer:"}
)
print("Subtitle included" in subtitle_prompt)  # True

# Extract subtitles for specific frames
video_path = "/path/to/video.mp4"
subtitle_path = "/path/to/subtitle.srt"
subtitle_frames, total_frames = extract_subtitles(video_path, subtitle_path)
# subtitle_frames: [(start_frame, end_frame, text), ...]
print(f"Total frames: {total_frames}, Subtitle segments: {len(subtitle_frames)}")

# Process model results
model_output = ["The angle measure is 45 degrees, so the answer is B."]
result = videomathqa_process_results(doc, model_output)
print(result["videomathqa_perception_score"]["pred_answer"])  # "B"

# MCQ aggregation (per-question scoring)
all_results = [
    {"question_id": "q001", "duration": "short", "category": "Geometry Angle",
     "pred_answer": "B", "answer": "B"},
    {"question_id": "q002", "duration": "short", "category": "Geometry Angle",
     "pred_answer": "A", "answer": "B"},
    {"question_id": "q003", "duration": "medium", "category": "Counting",
     "pred_answer": "C", "answer": "C"}
]
mcq_score = videomathqa_mcq_aggregate_results(all_results)
print(f"MCQ Accuracy: {mcq_score:.1f}%")  # 66.7% (2/3 correct)
# Also prints per-category and per-duration breakdowns

# Multi-binary aggregation (grouped questions)
multi_binary_results = [
    {"question_id": "q001_1", "duration": "short", "category": "Geometry Angle",
     "pred_answer": "A", "answer": "A"},
    {"question_id": "q001_2", "duration": "short", "category": "Geometry Angle",
     "pred_answer": "B", "answer": "B"},
    {"question_id": "q002_1", "duration": "short", "category": "Counting",
     "pred_answer": "A", "answer": "B"}  # Wrong - entire group fails
]
binary_score = videomathqa_multi_binary_aggregate_results(multi_binary_results)
print(f"Multi-Binary Accuracy: {binary_score:.1f}%")  # 50% (1/2 groups fully correct)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment