Implementation:EvolvingLMMs Lab Lmms eval PAIBench U Utils
Task utility functions for the PAIBench-U (Perception and Understanding) benchmark, which evaluates video understanding through multiple-choice questions with hierarchical category analysis.
Location
/tmp/kapso_repo_sslb_59s/lmms_eval/tasks/paibench_u/utils.py
Overview
Provides video document processing, multiple-choice response parsing, and hierarchical accuracy aggregation (overall, category, subcategory) for PAIBench-U tasks.
Configuration
Module loads paibench_u.yaml at import to determine cache directory:
- Reads
dataset_kwargs.cache_dirfrom YAML - Constructs full cache path:
$HF_HOME/{cache_dir} - Default
HF_HOME:~/.cache/huggingface
Global variables:
base_cache_dir: Expanded HF_HOME pathcache_dir_test: Full cache directory path
Core Functions
Document Processing
paibench_u_doc_to_visual(doc)- Retrieves video path from cache directory
- Parameters:
doc- Document with"video_path"key - Process:
- Constructs path:
{cache_dir_test}/videos/{doc["video_path"]} - Checks if video exists
- Returns path list or empty list (with warning)
- Constructs path:
- Returns: List with video path, or empty list if not found
paibench_u_doc_to_text(doc, lmms_eval_specific_kwargs=None)- Constructs question with formatted options
- Parameters:
doc- Document withquestionandindex2ans(dict)lmms_eval_specific_kwargs- Optional dict withpre_promptandpost_prompt
- Process:
- Extracts question
- Sorts options by key
- Filters non-null options
- Formats as
"A. option1","B. option2", etc. - Joins with question
- Adds pre/post prompts if provided
- Returns: Full formatted prompt string
Response Parsing
parse_multi_choice_response(response)- Extracts single letter answer from model response
- Parameters:
response- Model's raw response string
Parsing Logic:
- Strips whitespace
- Removes common answer prefixes:
- "The best answer is", "The correct answer is", "The answer is", "The answer"
- "The best option is", "The correct option is"
- "Best answer:", "Best option:"
- If response > 10 words and no A-E found: returns random choice
- Tries multiple regex patterns in order:
\(([ABCDE])\)- Matches (A)\[([ABCDE])\]- Matches [A]([ABCDE])\)- Matches A)([ABCDE])\.- Matches A.([ABCDE])- Matches A
- If no pattern matches: returns random choice
- Returns: Single letter string "A" through "E"
- Note: Fixed to avoid parsing "D. A book" as "A" (uses ordered pattern matching)
Result Processing
paibench_u_process_results(doc, results)- Processes model response into evaluation metrics
- Parameters:
doc- Document withquestion,answer,category,subcategoryresults- Model prediction list
- Process:
- Extracts prediction from results[0]
- Parses prediction to single letter
- Extracts category and subcategory
- Constructs data dictionary
- Returns: Dictionary with
paibench_u_perception_scoreentry containing:question_id: Question textpred_answer: Parsed prediction (A-E)answer: Ground truth (A-E)category: Top-level categorysubcategory: Fine-grained subcategory
Aggregation
paibench_u_aggregate_results(results)- Computes hierarchical accuracy metrics
- Parameters:
results- List of result dictionaries
Aggregation Process:
- Initializes counters:
- Overall:
total_correct,total_answered - Category-level:
category_scoresdict - Subcategory-level:
subcategory_scoresdict (splits on ":")
- Overall:
- For each result:
- Determines correctness:
pred_answer == answer - Updates overall counters
- Updates category counters
- Updates subcategory counters
- Determines correctness:
- Computes accuracies:
- Overall:
100 * correct / answered - Per-category:
100 * category_correct / category_answered - Per-subcategory:
100 * subcat_correct / subcat_answered
- Overall:
- Logs all metrics with counts
- Returns: Overall accuracy as percentage (0-100)
Logging Output:
- Overall accuracy with counts
- Category-level accuracy with counts
- Subcategory-level accuracy with counts
Metrics Dictionary Structure:
{
"overall": float,
"category": {
"category_name": float,
...
},
"subcategory": {
"subcat_name": float,
...
}
}
Design Notes
Random Choice Fallback
When parsing fails, the function returns a random choice to:
- Match original LongVideoBench paper behavior (author: Haoning Wu)
- Avoid systematic bias from always defaulting to same choice
- Penalize unparseable responses without complete failure
Subcategory Processing
- Subcategory strings may contain colons (e.g., "orientation:left-right")
- Code splits on ":" and takes first part for grouping
- Preserves hierarchical organization of categories
Dependencies
os,pathlib.Pathyaml- Configuration file parsingloguru.loggeraseval_logger
Environment Variables
HF_HOME: Hugging Face cache directory (default:~/.cache/huggingface)
Related
- Task_Utility_Functions - General task utility pattern
- Video_Task_Utils - Video processing utilities
- Multiple_Choice_Parsing - Response parsing patterns
- Hierarchical_Metrics - Nested metric aggregation