Implementation:EvolvingLMMs Lab Lmms eval Task Utility Interface
| Knowledge Sources | |
|---|---|
| Domains | Data_Processing, Task_Management |
| Last Updated | 2026-02-14 00:00 GMT |
Overview
Concrete tool for implementing task-specific data extraction and result processing functions provided by the lmms-eval framework.
Description
The lmms-eval framework defines a set of method interfaces on ConfigurableTask that dispatch to user-provided utility functions. These methods handle three types of resolution for their corresponding config values:
String resolution: If the config value is a string that matches a dataset column name, the method returns doc[column_name] directly. If it does not match a column name, it is treated as a Jinja2-style template and rendered with the document as context via utils.apply_template().
Callable resolution: If the config value was loaded from a !function YAML directive, it is a Python callable. The method invokes it, passing the document dict and optionally lmms_eval_specific_kwargs if configured.
Integer resolution: For doc_to_text and doc_to_target, an integer value is returned as-is, typically used as an index into choices.
The ChatMessage protocol (defined in lmms_eval/protocol.py) provides the structured message format used by doc_to_messages. Messages consist of a role ("user", "system", "assistant") and a list of content items, each typed as text, image, video, or audio. The ChatMessages class provides methods to convert between different message formats: to_hf_messages() for HuggingFace-style messages, to_openai_messages() for OpenAI API format, and extract_media() to separate media URLs from the message stream.
Usage
Use these interfaces when implementing a custom task's utils.py. Each function you write must conform to the expected signature. Reference them in your YAML with !function utils.function_name. The framework will call your functions automatically during the evaluation loop.
Code Reference
Source Location
- Repository: lmms-eval
- File:
lmms_eval/api/task.py(lines 1324-1414 for doc_to_text, doc_to_target, doc_to_visual),lmms_eval/protocol.py(lines 17-178 for ChatMessage protocol)
Signature
# In ConfigurableTask (lmms_eval/api/task.py):
def doc_to_text(self, doc: dict) -> Union[int, str]:
...
def doc_to_target(self, doc: dict) -> Union[int, str, list]:
...
def doc_to_visual(self, doc: dict) -> Union[int, str, list]:
...
# In protocol.py:
class ChatMessage(BaseModel):
role: Literal["user", "system", "assistant"]
content: List[ChatContent]
class ChatMessages(BaseModel):
messages: List[ChatMessage]
def extract_media(self) -> Tuple[list, list, list]:
...
def to_hf_messages(self, video_kwargs=None) -> list:
...
def to_openai_messages(self, video_kwargs=None) -> list:
...
Import
# For implementing utility functions, no special import is needed.
# Your utils.py is automatically imported by the framework.
# For using the ChatMessage protocol:
from lmms_eval.protocol import ChatMessage, ChatMessages
from lmms_eval.protocol import ChatTextContent, ChatImageContent
from lmms_eval.protocol import ChatVideoContent, ChatAudioContent
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| doc | dict |
Yes | A single dataset document represented as a dictionary. Keys correspond to dataset column names, values to the row's column values. For multimodal datasets, image columns contain PIL Image objects and video columns contain file paths. |
| lmms_eval_specific_kwargs | Optional[dict] |
No | Model-specific keyword arguments (e.g., pre_prompt, post_prompt) configured in YAML under lmms_eval_specific_kwargs. Passed as the second argument to doc_to_text and doc_to_visual when available.
|
| results | List[str] |
Yes (for process_results) | A list of model output strings. Typically contains a single element for non-multi-round tasks. |
Outputs
| Name | Type | Description |
|---|---|---|
| doc_to_visual return | List[Union[PIL.Image.Image, str]] |
A list of visual media items (PIL Images for image tasks, file path strings for video tasks). Always a list, even for single-image tasks. |
| doc_to_text return | str |
The text prompt string to send to the model. |
| doc_to_messages return | List[ChatMessage] |
A list of structured chat messages with typed content (text, image, video, audio). |
| process_results return | Dict[str, Any] |
A dictionary mapping metric names to their per-sample values. Keys must match metric names in the task's metric_list.
|
Usage Examples
Basic Example
# utils.py for a simple VQA task
def my_doc_to_visual(doc):
"""Extract the image from the document."""
return [doc["image"].convert("RGB")]
def my_doc_to_text(doc, lmms_eval_specific_kwargs=None):
"""Construct the prompt from the question field."""
question = doc["question"]
if lmms_eval_specific_kwargs:
pre = lmms_eval_specific_kwargs.get("pre_prompt", "")
post = lmms_eval_specific_kwargs.get("post_prompt", "")
return f"{pre}{question}{post}"
return question
def my_process_results(doc, results):
"""Score the model output against the ground truth."""
pred = results[0].strip().lower()
target = doc["answer"].strip().lower()
score = 1.0 if pred == target else 0.0
return {"exact_match": score}
With ChatMessage Protocol
# utils.py for a chat-based task using doc_to_messages
from lmms_eval.protocol import (
ChatMessage,
ChatTextContent,
ChatImageContent,
)
def my_doc_to_messages(doc):
"""Build a structured chat message with image and text."""
return [
ChatMessage(
role="user",
content=[
ChatImageContent(url=doc["image"]),
ChatTextContent(text=doc["question"]),
],
)
]
MME Example (Real-World)
# From lmms_eval/tasks/mme/utils.py
def mme_doc_to_visual(doc):
return [doc["image"].convert("RGB")]
def mme_doc_to_text(doc, lmms_eval_specific_kwargs=None):
question = doc["question"].strip()
if "pre_prompt" in lmms_eval_specific_kwargs:
question = f"{lmms_eval_specific_kwargs['pre_prompt']}{question}"
if "post_prompt" in lmms_eval_specific_kwargs:
question = f"{question}{lmms_eval_specific_kwargs['post_prompt']}"
return question
def mme_process_results(doc, results):
pred = results[0]
pred_ans = parse_pred_ans(pred)
gt_ans = doc["answer"].lower().strip().replace(".", "")
score = 1.0 if pred_ans == gt_ans else 0.0
category = doc["category"]
key = "mme_perception_score" if category in PERCEPTION_CATEGORIES else "mme_cognition_score"
return {key: {"question_id": doc["question_id"], "category": category, "score": score}}