Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Principle:EvolvingLMMs Lab Lmms eval Task Utility Functions

From Leeroopedia
Knowledge Sources
Domains Data_Processing, Task_Management
Last Updated 2026-02-14 00:00 GMT

Overview

Task-specific logic for extracting prompts, media, and computing results should be encapsulated in small, focused utility functions that conform to a standard interface.

Description

Every evaluation task must bridge the gap between the raw dataset rows and the inputs expected by the model, and between the model's raw outputs and the final metric scores. This bridging is accomplished through a set of utility functions, each responsible for one transformation step. By defining a standard interface for these functions, the framework can orchestrate the evaluation loop generically while delegating all task-specific logic to user-provided callables.

The key utility function interfaces are:

doc_to_visual(doc) -> list: Extracts visual media (images or video paths) from a dataset document. This function receives a single document dict and returns a list of PIL Image objects, video file paths, or other media references. Even for single-image tasks, the return type is always a list for consistency. For text-only tasks, this function may return None.

doc_to_text(doc) -> str: Constructs the text prompt from a dataset document. This may be as simple as returning a single column value, or it may involve complex template rendering that combines multiple fields, inserts task-specific instructions, and applies model-specific prompt formatting. When lmms_eval_specific_kwargs is configured, the function may accept a second argument containing model-specific prompt parameters (such as a pre-prompt or post-prompt).

doc_to_messages(doc) -> list[dict]: An alternative to doc_to_text + doc_to_visual for chat-based models. Returns a structured list of chat messages conforming to the ChatMessage protocol, where each message has a role (user, system, assistant) and content (a list of typed content items: text, image, video, audio).

process_results(doc, results) -> dict: Post-processes model outputs to produce metric-ready values. This function receives the original document and the model's output(s), and returns a dictionary mapping metric names to their values for this sample. This is where task-specific scoring logic lives (e.g., parsing yes/no answers, computing per-category scores, matching against ground truth).

These functions are typically defined in a utils.py file co-located with the task YAML and referenced via !function directives. This keeps task-specific logic separate from the evaluation framework while maintaining a clear contract.

Usage

Implement these utility functions whenever your task requires custom prompt construction, media extraction, or result processing that cannot be expressed as simple column references or Jinja2 templates. Place them in utils.py within your task directory and reference them in the YAML configuration with !function utils.function_name.

Theoretical Basis

The utility function pattern implements a Strategy design pattern where each function is an interchangeable strategy for one step of the evaluation pipeline:

Evaluation Pipeline:

doc --> doc_to_visual(doc) --> visuals: List[Image|Video]
doc --> doc_to_text(doc) --> prompt: str
         OR
doc --> doc_to_messages(doc) --> messages: List[ChatMessage]

(visuals, prompt) --> Model --> raw_output: str

(doc, raw_output) --> process_results(doc, [raw_output]) --> {metric_name: value}

The function signatures establish a contract:

doc_to_visual:   Dict -> List[Union[PIL.Image, str]]
doc_to_text:     Dict -> str
doc_to_text:     (Dict, Dict) -> str        # with lmms_eval_specific_kwargs
doc_to_messages: Dict -> List[ChatMessage]
process_results: (Dict, List[str]) -> Dict[str, Any]

Where Dict represents a dataset document (a row from the HuggingFace dataset as a dictionary). The return type of process_results must have keys that match the metric names in the task's metric_list, ensuring the framework can route values to the correct aggregation functions.

This design achieves separation of concerns: the evaluation loop handles batching, caching, and orchestration, while utility functions handle all domain-specific interpretation.

Related Pages

Implemented By

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment