Principle:EvolvingLMMs Lab Lmms eval YAML Task Configuration
| Knowledge Sources | |
|---|---|
| Domains | Configuration, Task_Management |
| Last Updated | 2026-02-14 00:00 GMT |
Overview
Evaluation tasks should be specified declaratively through structured configuration files rather than through imperative code, enabling rapid task creation without modifying framework internals.
Description
Declarative configuration is a central design principle of lmms-eval. Rather than requiring users to write Python classes for every new benchmark, the framework allows tasks to be defined entirely through YAML files. A single YAML file specifies everything the framework needs to know: where to find the data, how to construct prompts, what output format to expect, and how to score results.
This approach has several advantages:
Accessibility: Researchers who are not framework developers can add new benchmarks by writing YAML and, optionally, small utility functions. No understanding of the evaluation loop internals is required.
Composability: YAML configurations support template inheritance via the include directive. A base template can define shared settings (generation parameters, common metric configurations), and individual task YAMLs can override only the fields that differ. This reduces duplication across related benchmarks.
Transparency: Because the configuration is a flat, human-readable file, it is easy to review, version-control, and share. The exact evaluation protocol for any task can be understood by reading its YAML.
The YAML configuration maps directly to the TaskConfig dataclass, which defines all supported fields. The most important fields fall into several categories:
Data fields: dataset_path, dataset_name, dataset_kwargs, test_split, validation_split, training_split, fewshot_split.
Prompt fields: doc_to_text, doc_to_visual, doc_to_target, doc_to_choice, doc_to_messages. These can be column names (strings), Jinja2 templates, or !function references to Python callables.
Output and generation fields: output_type (one of "generate_until", "loglikelihood", "multiple_choice", "generate_until_multi_round"), generation_kwargs (temperature, max tokens, etc.).
Metric fields: metric_list (a list of metric configurations), process_results (a custom result processing function).
Model-specific fields: lmms_eval_specific_kwargs, model_specific_generation_kwargs, model_specific_target_kwargs for per-model overrides.
The !function YAML tag is a custom constructor that resolves a string like utils.my_function to the actual Python callable at load time by importing from the task's companion utils.py module.
Usage
Use YAML task configuration whenever you create a new evaluation task. Start by identifying the closest existing task YAML as a template, copy it into a new task directory, and modify the fields to match your new benchmark. For benchmarks that require custom prompt construction or result processing, implement the necessary functions in a utils.py file and reference them with !function directives.
Theoretical Basis
The YAML configuration system implements a mapping from declarative specification to an executable task object:
YAML File --> TaskConfig Dataclass --> ConfigurableTask Instance
The TaskConfig dataclass defines the schema:
@dataclass
class TaskConfig(dict):
task: str = None
dataset_path: str = None
dataset_name: str = None
output_type: str = "generate_until"
doc_to_text: Union[Callable, str] = None
doc_to_visual: Union[Callable, str] = None
doc_to_target: Union[Callable, str] = None
doc_to_messages: Callable = None
process_results: Union[Callable, str] = None
metric_list: list = None
generation_kwargs: dict = None
# ... additional fields
The resolution of !function references follows the pattern:
"!function utils.my_func" --> import utils from task directory --> getattr(utils, "my_func")
Template inheritance works through the include key:
child_config = merge(load(include_path), child_yaml_fields)
Where child fields take precedence over included base fields, following a last-writer-wins merge strategy.