Implementation:EvolvingLMMs Lab Lmms eval PAIBench U Utils

Task utility functions for the PAIBench-U (Perception and Understanding) benchmark, which evaluates video understanding through multiple-choice questions with hierarchical category analysis.

Location

/tmp/kapso_repo_sslb_59s/lmms_eval/tasks/paibench_u/utils.py

Overview

Provides video document processing, multiple-choice response parsing, and hierarchical accuracy aggregation (overall, category, subcategory) for PAIBench-U tasks.

Configuration

Module loads paibench_u.yaml at import to determine cache directory:

Reads dataset_kwargs.cache_dir from YAML
Constructs full cache path: $HF_HOME/{cache_dir}
Default HF_HOME: ~/.cache/huggingface

Global variables:

base_cache_dir: Expanded HF_HOME path
cache_dir_test: Full cache directory path

Core Functions

Document Processing

paibench_u_doc_to_visual(doc)

Retrieves video path from cache directory

Parameters: doc - Document with "video_path" key

Process:

Constructs path: {cache_dir_test}/videos/{doc["video_path"]}
Checks if video exists
Returns path list or empty list (with warning)

Returns: List with video path, or empty list if not found

paibench_u_doc_to_text(doc, lmms_eval_specific_kwargs=None)

Constructs question with formatted options

Parameters:

doc - Document with question and index2ans (dict)
lmms_eval_specific_kwargs - Optional dict with pre_prompt and post_prompt

Process:

Extracts question
Sorts options by key
Filters non-null options
Formats as "A. option1", "B. option2", etc.
Joins with question
Adds pre/post prompts if provided

Returns: Full formatted prompt string

Response Parsing

parse_multi_choice_response(response): Extracts single letter answer from model response; Parameters: response - Model's raw response string

Parsing Logic:

Strips whitespace
Removes common answer prefixes:
- "The best answer is", "The correct answer is", "The answer is", "The answer"
- "The best option is", "The correct option is"
- "Best answer:", "Best option:"
If response > 10 words and no A-E found: returns random choice
Tries multiple regex patterns in order:
- $([ABCDE])$ - Matches (A)
- \[([ABCDE])\] - Matches [A]
- ([ABCDE])\) - Matches A)
- ([ABCDE])\. - Matches A.
- ([ABCDE]) - Matches A
If no pattern matches: returns random choice

Returns: Single letter string "A" through "E"

Note: Fixed to avoid parsing "D. A book" as "A" (uses ordered pattern matching)

Result Processing

paibench_u_process_results(doc, results)

Processes model response into evaluation metrics

Parameters:

doc - Document with question, answer, category, subcategory
results - Model prediction list

Process:

Extracts prediction from results[0]
Parses prediction to single letter
Extracts category and subcategory
Constructs data dictionary

Returns: Dictionary with paibench_u_perception_score entry containing:

question_id: Question text
pred_answer: Parsed prediction (A-E)
answer: Ground truth (A-E)
category: Top-level category
subcategory: Fine-grained subcategory

Aggregation

paibench_u_aggregate_results(results): Computes hierarchical accuracy metrics; Parameters: results - List of result dictionaries

Aggregation Process:

Initializes counters:
- Overall: total_correct, total_answered
- Category-level: category_scores dict
- Subcategory-level: subcategory_scores dict (splits on ":")
For each result:
- Determines correctness: pred_answer == answer
- Updates overall counters
- Updates category counters
- Updates subcategory counters
Computes accuracies:
- Overall: 100 * correct / answered
- Per-category: 100 * category_correct / category_answered
- Per-subcategory: 100 * subcat_correct / subcat_answered
Logs all metrics with counts

Returns: Overall accuracy as percentage (0-100)

Logging Output:

Overall accuracy with counts
Category-level accuracy with counts
Subcategory-level accuracy with counts

Metrics Dictionary Structure:

{
  "overall": float,
  "category": {
    "category_name": float,
    ...
  },
  "subcategory": {
    "subcat_name": float,
    ...
  }
}

Design Notes

Random Choice Fallback

When parsing fails, the function returns a random choice to:

Match original LongVideoBench paper behavior (author: Haoning Wu)
Avoid systematic bias from always defaulting to same choice
Penalize unparseable responses without complete failure

Subcategory Processing

Subcategory strings may contain colons (e.g., "orientation:left-right")
Code splits on ":" and takes first part for grouping
Preserves hierarchical organization of categories

Dependencies

os, pathlib.Path
yaml - Configuration file parsing
loguru.logger as eval_logger

Environment Variables

HF_HOME: Hugging Face cache directory (default: ~/.cache/huggingface)

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment

Location

Overview

Configuration

Core Functions

Document Processing

Response Parsing

Result Processing

Aggregation

Design Notes

Random Choice Fallback

Subcategory Processing

Dependencies

Environment Variables

Related

Page Connections