Workflow:EvolvingLMMs Lab Lmms eval Custom Task Creation
| Knowledge Sources | |
|---|---|
| Domains | LLMs, Multimodal_Evaluation, Benchmarking |
| Last Updated | 2026-02-14 00:00 GMT |
Overview
End-to-end process for creating a new evaluation task in lmms-eval, from dataset preparation through YAML configuration, utility function implementation, and metric definition.
Description
This workflow guides the creation of a custom evaluation task that integrates with the lmms-eval framework. Tasks are defined through YAML configuration files that specify the dataset source, prompt construction functions, output type, generation parameters, and metric definitions. A companion utils.py file implements the Python functions referenced by the YAML config (doc_to_visual, doc_to_text, doc_to_messages, process_results, and metric aggregation). The task is automatically discovered by the TaskManager through recursive YAML scanning of the tasks directory.
Usage
Execute this workflow when you need to evaluate models on a dataset or benchmark that is not already supported by lmms-eval. This applies when introducing a new academic benchmark, creating a custom evaluation suite for a specific domain, or adapting an existing task with different prompting strategies or metrics.
Execution Steps
Step 1: Dataset Preparation
Prepare or identify the evaluation dataset. The dataset should be hosted on HuggingFace Hub (preferred) or available locally as JSON/CSV files. Each sample should contain the input data (text, images, video, audio) and ground-truth annotations. Define which dataset split to use for evaluation (typically test or validation) and optionally which split to use for few-shot examples.
Key considerations:
- HuggingFace Hub datasets are loaded via datasets.load_dataset with dataset_path and dataset_name
- Local datasets can be specified through dataset_kwargs (data_files, data_dir)
- If the dataset requires authentication, set token: True in dataset_kwargs
- Consider using process_docs to preprocess or filter the dataset before evaluation
Step 2: Task Directory Creation
Create a new directory under lmms_eval/tasks/ for the task. The directory should contain at minimum a YAML configuration file and a utils.py file with the task-specific Python functions. For tasks with multiple subtask variants (e.g., different splits or prompting styles), create a group YAML that aggregates them and individual YAML files per variant using a shared default template.
Key considerations:
- Directory name should match the task identifier
- Use _default_template_yaml for shared configuration across subtask variants
- Group YAMLs aggregate subtasks under a single name for convenient invocation
- Include a README.md to document the benchmark and citation
Step 3: YAML Configuration
Define the task configuration in YAML format. This specifies the dataset source, output type (generate_until, loglikelihood, or multiple_choice), prompt construction functions, generation parameters, and metric list. The YAML references Python functions from utils.py using the !function directive. Template inheritance via the include directive reduces duplication across subtask variants.
Key considerations:
- output_type determines how the model is queried: generation, log-likelihood scoring, or multiple choice
- doc_to_messages is the preferred prompt format (structured messages with roles); doc_to_visual + doc_to_text is the legacy format
- generation_kwargs control max_new_tokens, temperature, top_p, and other generation parameters
- lmms_eval_specific_kwargs allows model-specific prompt variations (e.g., different post_prompt for different model families)
Step 4: Utility Functions Implementation
Implement the Python functions referenced by the YAML configuration in utils.py. These include: doc_to_visual (extracts visual inputs from a dataset sample), doc_to_text (formats the text prompt), doc_to_messages (creates structured chat messages with interleaved media), doc_to_target (extracts the ground-truth answer), process_results (parses model output and computes per-sample metrics), and aggregation functions (compute final scores from per-sample results).
Key considerations:
- process_results runs in parallel across GPUs; use it for per-sample scoring and external judge calls (e.g., GPT-4)
- Aggregation functions run only on rank 0; use them for final score computation
- Return a dictionary from process_results where keys match metric names in metric_list
- For GPT-as-judge evaluation, implement API calls with retry logic in process_results
Step 5: Metric Definition
Define how model outputs are scored and aggregated. Standard metrics (acc, exact_match, BLEU, F1) are available in lmms_eval/api/metrics.py. Custom metrics are defined by specifying a metric name in metric_list, implementing a corresponding aggregation function in utils.py, and returning matching keys from process_results. The higher_is_better flag indicates metric directionality.
Key considerations:
- Standard metrics like acc and exact_match have predefined aggregation
- Custom metrics require both a process_results return key and an aggregation function
- Multiple metrics can be defined per task for multi-dimensional evaluation
- Bootstrap confidence intervals are computed automatically for all metrics
Step 6: Testing and Validation
Verify the task works correctly by running a limited evaluation. Use --limit to test with a small number of samples and --log_samples to inspect model outputs. Check that prompts are formatted correctly, metrics are computed as expected, and results match known baselines if available. The --check_integrity flag runs task-specific tests.
Key considerations:
- Always test with --limit 8 first to catch configuration errors
- Inspect logged samples to verify prompt formatting and answer extraction
- Compare results against published baselines when available
- Test with multiple model types to ensure compatibility with both chat and simple interfaces