Principle:EvolvingLMMs Lab Lmms eval Task Utility Functions
| Knowledge Sources | |
|---|---|
| Domains | Data_Processing, Task_Management |
| Last Updated | 2026-02-14 00:00 GMT |
Overview
Task-specific logic for extracting prompts, media, and computing results should be encapsulated in small, focused utility functions that conform to a standard interface.
Description
Every evaluation task must bridge the gap between the raw dataset rows and the inputs expected by the model, and between the model's raw outputs and the final metric scores. This bridging is accomplished through a set of utility functions, each responsible for one transformation step. By defining a standard interface for these functions, the framework can orchestrate the evaluation loop generically while delegating all task-specific logic to user-provided callables.
The key utility function interfaces are:
doc_to_visual(doc) -> list: Extracts visual media (images or video paths) from a dataset document. This function receives a single document dict and returns a list of PIL Image objects, video file paths, or other media references. Even for single-image tasks, the return type is always a list for consistency. For text-only tasks, this function may return None.
doc_to_text(doc) -> str: Constructs the text prompt from a dataset document. This may be as simple as returning a single column value, or it may involve complex template rendering that combines multiple fields, inserts task-specific instructions, and applies model-specific prompt formatting. When lmms_eval_specific_kwargs is configured, the function may accept a second argument containing model-specific prompt parameters (such as a pre-prompt or post-prompt).
doc_to_messages(doc) -> list[dict]: An alternative to doc_to_text + doc_to_visual for chat-based models. Returns a structured list of chat messages conforming to the ChatMessage protocol, where each message has a role (user, system, assistant) and content (a list of typed content items: text, image, video, audio).
process_results(doc, results) -> dict: Post-processes model outputs to produce metric-ready values. This function receives the original document and the model's output(s), and returns a dictionary mapping metric names to their values for this sample. This is where task-specific scoring logic lives (e.g., parsing yes/no answers, computing per-category scores, matching against ground truth).
These functions are typically defined in a utils.py file co-located with the task YAML and referenced via !function directives. This keeps task-specific logic separate from the evaluation framework while maintaining a clear contract.
Usage
Implement these utility functions whenever your task requires custom prompt construction, media extraction, or result processing that cannot be expressed as simple column references or Jinja2 templates. Place them in utils.py within your task directory and reference them in the YAML configuration with !function utils.function_name.
Theoretical Basis
The utility function pattern implements a Strategy design pattern where each function is an interchangeable strategy for one step of the evaluation pipeline:
Evaluation Pipeline:
doc --> doc_to_visual(doc) --> visuals: List[Image|Video]
doc --> doc_to_text(doc) --> prompt: str
OR
doc --> doc_to_messages(doc) --> messages: List[ChatMessage]
(visuals, prompt) --> Model --> raw_output: str
(doc, raw_output) --> process_results(doc, [raw_output]) --> {metric_name: value}
The function signatures establish a contract:
doc_to_visual: Dict -> List[Union[PIL.Image, str]]
doc_to_text: Dict -> str
doc_to_text: (Dict, Dict) -> str # with lmms_eval_specific_kwargs
doc_to_messages: Dict -> List[ChatMessage]
process_results: (Dict, List[str]) -> Dict[str, Any]
Where Dict represents a dataset document (a row from the HuggingFace dataset as a dictionary). The return type of process_results must have keys that match the metric names in the task's metric_list, ensuring the framework can route values to the correct aggregation functions.
This design achieves separation of concerns: the evaluation loop handles batching, caching, and orchestration, while utility functions handle all domain-specific interpretation.