Workflow:EvolvingLMMs Lab Lmms eval Custom Task Creation

Knowledge Sources	lmms-eval Task Guide
Domains	LLMs, Multimodal_Evaluation, Benchmarking
Last Updated	2026-02-14 00:00 GMT

Overview

End-to-end process for creating a new evaluation task in lmms-eval, from dataset preparation through YAML configuration, utility function implementation, and metric definition.

Description

This workflow guides the creation of a custom evaluation task that integrates with the lmms-eval framework. Tasks are defined through YAML configuration files that specify the dataset source, prompt construction functions, output type, generation parameters, and metric definitions. A companion utils.py file implements the Python functions referenced by the YAML config (doc_to_visual, doc_to_text, doc_to_messages, process_results, and metric aggregation). The task is automatically discovered by the TaskManager through recursive YAML scanning of the tasks directory.

Usage

Execute this workflow when you need to evaluate models on a dataset or benchmark that is not already supported by lmms-eval. This applies when introducing a new academic benchmark, creating a custom evaluation suite for a specific domain, or adapting an existing task with different prompting strategies or metrics.

Execution Steps

Step 1: Dataset Preparation

Prepare or identify the evaluation dataset. The dataset should be hosted on HuggingFace Hub (preferred) or available locally as JSON/CSV files. Each sample should contain the input data (text, images, video, audio) and ground-truth annotations. Define which dataset split to use for evaluation (typically test or validation) and optionally which split to use for few-shot examples.

Key considerations:

HuggingFace Hub datasets are loaded via datasets.load_dataset with dataset_path and dataset_name
Local datasets can be specified through dataset_kwargs (data_files, data_dir)
If the dataset requires authentication, set token: True in dataset_kwargs
Consider using process_docs to preprocess or filter the dataset before evaluation

Step 2: Task Directory Creation

Create a new directory under lmms_eval/tasks/ for the task. The directory should contain at minimum a YAML configuration file and a utils.py file with the task-specific Python functions. For tasks with multiple subtask variants (e.g., different splits or prompting styles), create a group YAML that aggregates them and individual YAML files per variant using a shared default template.

Key considerations:

Directory name should match the task identifier
Use _default_template_yaml for shared configuration across subtask variants
Group YAMLs aggregate subtasks under a single name for convenient invocation
Include a README.md to document the benchmark and citation

Step 3: YAML Configuration

Define the task configuration in YAML format. This specifies the dataset source, output type (generate_until, loglikelihood, or multiple_choice), prompt construction functions, generation parameters, and metric list. The YAML references Python functions from utils.py using the !function directive. Template inheritance via the include directive reduces duplication across subtask variants.

Key considerations:

output_type determines how the model is queried: generation, log-likelihood scoring, or multiple choice
doc_to_messages is the preferred prompt format (structured messages with roles); doc_to_visual + doc_to_text is the legacy format
generation_kwargs control max_new_tokens, temperature, top_p, and other generation parameters
lmms_eval_specific_kwargs allows model-specific prompt variations (e.g., different post_prompt for different model families)

Step 4: Utility Functions Implementation

Implement the Python functions referenced by the YAML configuration in utils.py. These include: doc_to_visual (extracts visual inputs from a dataset sample), doc_to_text (formats the text prompt), doc_to_messages (creates structured chat messages with interleaved media), doc_to_target (extracts the ground-truth answer), process_results (parses model output and computes per-sample metrics), and aggregation functions (compute final scores from per-sample results).

Key considerations:

process_results runs in parallel across GPUs; use it for per-sample scoring and external judge calls (e.g., GPT-4)
Aggregation functions run only on rank 0; use them for final score computation
Return a dictionary from process_results where keys match metric names in metric_list
For GPT-as-judge evaluation, implement API calls with retry logic in process_results

Step 5: Metric Definition

Define how model outputs are scored and aggregated. Standard metrics (acc, exact_match, BLEU, F1) are available in lmms_eval/api/metrics.py. Custom metrics are defined by specifying a metric name in metric_list, implementing a corresponding aggregation function in utils.py, and returning matching keys from process_results. The higher_is_better flag indicates metric directionality.

Key considerations:

Standard metrics like acc and exact_match have predefined aggregation
Custom metrics require both a process_results return key and an aggregation function
Multiple metrics can be defined per task for multi-dimensional evaluation
Bootstrap confidence intervals are computed automatically for all metrics

Step 6: Testing and Validation

Verify the task works correctly by running a limited evaluation. Use --limit to test with a small number of samples and --log_samples to inspect model outputs. Check that prompts are formatted correctly, metrics are computed as expected, and results match known baselines if available. The --check_integrity flag runs task-specific tests.

Key considerations:

Always test with --limit 8 first to catch configuration errors
Inspect logged samples to verify prompt formatting and answer extraction
Compare results against published baselines when available
Test with multiple model types to ensure compatibility with both chat and simple interfaces

Execution Diagram

GitHub URL

Workflow Repository