Principle:EvolvingLMMs Lab Lmms eval Task Testing
| Knowledge Sources | |
|---|---|
| Domains | Testing, Evaluation |
| Last Updated | 2026-02-14 00:00 GMT |
Overview
Custom evaluation tasks must be validated through limited test runs with sample logging before being used for full-scale evaluation.
Description
Creating a new evaluation task involves multiple configuration points -- dataset loading, prompt construction, media extraction, metric computation, and aggregation. Any misconfiguration at any of these points can produce silently incorrect results. Therefore, a disciplined testing workflow is essential.
Task testing follows a progressive validation strategy:
1. Smoke test with --limit: Run the evaluation on a small number of samples (e.g., 8) to verify that the entire pipeline executes without errors. This catches configuration issues such as missing dataset columns, incorrect function references, type mismatches in doc_to_text output, or missing metric registrations. The --limit flag restricts evaluation to the specified number of examples, enabling rapid iteration.
2. Sample inspection with --log_samples: Enable per-sample logging to examine the exact inputs sent to the model and the raw outputs received. This verifies that prompts are correctly constructed, images are properly extracted, and generation parameters produce reasonable outputs. The logged samples are saved as JSON files that can be inspected manually.
3. Metric validation: After a limited run completes successfully, examine the reported metrics to ensure they are in the expected range and that aggregation is working correctly. For tasks with custom process_results and aggregation functions, this step is particularly important.
4. Predict-only mode: For expensive evaluations or when debugging prompt formatting, use --predict_only to generate model outputs without computing metrics. This is useful for inspecting raw model behavior before investing in metric implementation.
5. Full evaluation: Once the task passes all validation steps, run without --limit for the complete benchmark evaluation.
The testing principle also applies to the directory structure: the framework validates the task configuration at initialization time by processing a test document through doc_to_text, doc_to_target, and optionally doc_to_choice to catch issues before the evaluation loop begins.
Usage
Use task testing every time you create or modify a custom task. Always start with a --limit 8 --log_samples run to verify correctness before running a full evaluation. When developing metric logic, use --predict_only to iterate on prompt design independently of metric implementation. Inspect the JSON output files to verify that prompts, model outputs, and scores all look correct.
Theoretical Basis
Task validation can be modeled as a sequence of increasingly comprehensive checks:
Level 0: Static validation
- YAML parses correctly
- !function references resolve to callable objects
- output_type is in {loglikelihood, multiple_choice, generate_until, generate_until_multi_round}
- metric names are registered or provided via process_results
Level 1: Document-level validation (during ConfigurableTask.__init__)
- doc_to_text(test_doc) returns str or int
- doc_to_target(test_doc) returns str, int, or list
- doc_to_choice(test_doc) returns list (if configured)
- Whitespace consistency between target_delimiter and choices
Level 2: Pipeline validation (--limit N)
- Dataset downloads and loads successfully
- All N documents process without exceptions
- Model generates output for all requests
- process_results produces expected metric keys
- Aggregation functions return numeric scores
Level 3: Quality validation (human inspection of --log_samples output)
- Prompts contain expected content
- Visual inputs are correct images/videos
- Model outputs are reasonable
- Metric scores are in expected range
The cost increases at each level, which is why the progressive strategy is important: catch simple configuration errors cheaply (Level 0-1) before investing in model inference (Level 2) or human review (Level 3).