Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:EvolvingLMMs Lab Lmms eval PAIBench U Utils

From Leeroopedia

Task utility functions for the PAIBench-U (Perception and Understanding) benchmark, which evaluates video understanding through multiple-choice questions with hierarchical category analysis.

Location

/tmp/kapso_repo_sslb_59s/lmms_eval/tasks/paibench_u/utils.py

Overview

Provides video document processing, multiple-choice response parsing, and hierarchical accuracy aggregation (overall, category, subcategory) for PAIBench-U tasks.

Configuration

Module loads paibench_u.yaml at import to determine cache directory:

  • Reads dataset_kwargs.cache_dir from YAML
  • Constructs full cache path: $HF_HOME/{cache_dir}
  • Default HF_HOME: ~/.cache/huggingface

Global variables:

  • base_cache_dir: Expanded HF_HOME path
  • cache_dir_test: Full cache directory path

Core Functions

Document Processing

paibench_u_doc_to_visual(doc)
Retrieves video path from cache directory
Parameters: doc - Document with "video_path" key
Process:
  1. Constructs path: {cache_dir_test}/videos/{doc["video_path"]}
  2. Checks if video exists
  3. Returns path list or empty list (with warning)
Returns: List with video path, or empty list if not found
paibench_u_doc_to_text(doc, lmms_eval_specific_kwargs=None)
Constructs question with formatted options
Parameters:
  • doc - Document with question and index2ans (dict)
  • lmms_eval_specific_kwargs - Optional dict with pre_prompt and post_prompt
Process:
  1. Extracts question
  2. Sorts options by key
  3. Filters non-null options
  4. Formats as "A. option1", "B. option2", etc.
  5. Joins with question
  6. Adds pre/post prompts if provided
Returns: Full formatted prompt string

Response Parsing

parse_multi_choice_response(response)
Extracts single letter answer from model response
Parameters: response - Model's raw response string

Parsing Logic:

  1. Strips whitespace
  2. Removes common answer prefixes:
    • "The best answer is", "The correct answer is", "The answer is", "The answer"
    • "The best option is", "The correct option is"
    • "Best answer:", "Best option:"
  3. If response > 10 words and no A-E found: returns random choice
  4. Tries multiple regex patterns in order:
    • \(([ABCDE])\) - Matches (A)
    • \[([ABCDE])\] - Matches [A]
    • ([ABCDE])\) - Matches A)
    • ([ABCDE])\. - Matches A.
    • ([ABCDE]) - Matches A
  5. If no pattern matches: returns random choice
Returns: Single letter string "A" through "E"
Note: Fixed to avoid parsing "D. A book" as "A" (uses ordered pattern matching)

Result Processing

paibench_u_process_results(doc, results)
Processes model response into evaluation metrics
Parameters:
  • doc - Document with question, answer, category, subcategory
  • results - Model prediction list
Process:
  1. Extracts prediction from results[0]
  2. Parses prediction to single letter
  3. Extracts category and subcategory
  4. Constructs data dictionary
Returns: Dictionary with paibench_u_perception_score entry containing:
  • question_id: Question text
  • pred_answer: Parsed prediction (A-E)
  • answer: Ground truth (A-E)
  • category: Top-level category
  • subcategory: Fine-grained subcategory

Aggregation

paibench_u_aggregate_results(results)
Computes hierarchical accuracy metrics
Parameters: results - List of result dictionaries

Aggregation Process:

  1. Initializes counters:
    • Overall: total_correct, total_answered
    • Category-level: category_scores dict
    • Subcategory-level: subcategory_scores dict (splits on ":")
  2. For each result:
    • Determines correctness: pred_answer == answer
    • Updates overall counters
    • Updates category counters
    • Updates subcategory counters
  3. Computes accuracies:
    • Overall: 100 * correct / answered
    • Per-category: 100 * category_correct / category_answered
    • Per-subcategory: 100 * subcat_correct / subcat_answered
  4. Logs all metrics with counts
Returns: Overall accuracy as percentage (0-100)

Logging Output:

  • Overall accuracy with counts
  • Category-level accuracy with counts
  • Subcategory-level accuracy with counts

Metrics Dictionary Structure:

{
  "overall": float,
  "category": {
    "category_name": float,
    ...
  },
  "subcategory": {
    "subcat_name": float,
    ...
  }
}

Design Notes

Random Choice Fallback

When parsing fails, the function returns a random choice to:

  • Match original LongVideoBench paper behavior (author: Haoning Wu)
  • Avoid systematic bias from always defaulting to same choice
  • Penalize unparseable responses without complete failure

Subcategory Processing

  • Subcategory strings may contain colons (e.g., "orientation:left-right")
  • Code splits on ":" and takes first part for grouping
  • Preserves hierarchical organization of categories

Dependencies

  • os, pathlib.Path
  • yaml - Configuration file parsing
  • loguru.logger as eval_logger

Environment Variables

  • HF_HOME: Hugging Face cache directory (default: ~/.cache/huggingface)

Related

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment