Implementation:EvolvingLMMs Lab Lmms eval LongTimeScope Utils
Location: /tmp/kapso_repo_sslb_59s/lmms_eval/tasks/longtimescope/utils.py
Principle: Task_Utility_Functions
Purpose
Task-specific utilities for LongTimeScope benchmark evaluating long video understanding across QA, OCR, and temporal reasoning tasks.
Constants
TASK_CATEGORIES = ["QA", "OCR", "temporal"]
Configuration
- Reads cache directory from
longtimescope.yaml - Base cache from HF_HOME environment variable (default: ~/.cache/huggingface/)
- Handles video files with multiple extensions (mp4, MP4, mkv)
Key Functions
convert_time_to_frame
def convert_time_to_frame(time_in_seconds, fps)
Converts time in seconds to frame number given FPS rate.
timescope_doc_to_visual
def timescope_doc_to_visual(doc)
Locates video file path:
- Constructs path from cache directory and doc["video"]
- Tries multiple extensions (mp4 -> MP4 -> mkv)
- Exits with error if video not found
- Returns list with single video path
extract_characters_regex
def extract_characters_regex(s)
Extracts answer choice from response text:
- Strips common answer prefixes ("The best answer is", etc.)
- Returns empty string if response too long (>10 words) without [ABCDEF]
- Uses regex to find first [ABCDEF] character
- Returns matched character or empty string
timescope_process_results
def timescope_process_results(doc, results)
Processes single result:
- Extracts predicted answer using extract_characters_regex
- Creates data dict with id, length, video, task_type, pred_answer, pred, answer
- Returns dict with "timescope_perception_score" metric
timescope_aggregate_results
def timescope_aggregate_results(results)
Aggregates results with detailed breakdown:
- Groups by video length and task type
- Computes accuracy for each length-task combination
- Logs per-length-task accuracy
- Logs per-length overall accuracy
- Returns overall accuracy across all videos
Implementation Details
- Video length tracking for granular analysis
- Task type categorization (QA, OCR, temporal)
- Multiple-choice format with options A-F
- Case-insensitive answer comparison
- Comprehensive logging of category-specific performance