Heuristic:EvolvingLMMs Lab Lmms eval Limit Flag Testing Only
| Knowledge Sources | |
|---|---|
| Domains | Evaluation, Best_Practices |
| Last Updated | 2026-02-14 00:00 GMT |
Overview
The --limit CLI flag should only be used for testing and debugging — never for computing real benchmark metrics.
Description
The --limit flag restricts the number of evaluation instances processed, allowing quick smoke tests of model + task configurations. However, metrics computed on limited data are statistically unreliable and non-representative of true model performance. The framework explicitly warns users about this with a prominently displayed warning message. Despite this, it remains a common mistake for users to report metrics obtained with --limit as official benchmark results.
Usage
Use this heuristic as a best practice reminder whenever setting up evaluation runs. Apply --limit only during development/debugging to verify that the pipeline runs end-to-end. Remove it for any evaluation where metrics will be reported or compared.
The Insight (Rule of Thumb)
- Action: Use
--limit Nonly for pipeline testing. Remove for real benchmarks. - Value: Common test values:
--limit 8or--limit 16for quick sanity checks. - Trade-off: Fast iteration during development vs. invalid metrics if accidentally left in production runs.
- Related:
--predict_onlyforceslog_samples=Trueand requires--output_path.
Reasoning
Benchmarks are designed with specific dataset sizes and distributions. Evaluating on a small subset introduces sampling bias and reduces statistical power. For example, a task with 1000 instances evaluated on 8 would have wide confidence intervals and could produce wildly different scores from run to run. The warning in the code emphasizes this is a testing-only feature.
Additionally, when --limit interacts with request caching, the framework temporarily ignores the limit and processes all documents to build a complete cache, then applies the limit afterward. This prevents incomplete caches.
Code evidence from lmms_eval/__main__.py:590-591:
if args.limit:
eval_logger.warning(" --limit SHOULD ONLY BE USED FOR TESTING."
"REAL METRICS SHOULD NOT BE COMPUTED USING LIMIT.")
Cache interaction from lmms_eval/api/task.py:432-434:
# process all documents when caching is specified for simplicity
if cache_requests and (not cached_instances or rewrite_requests_cache) and limit is not None:
limit = None
Predict-only mode validation from lmms_eval/__main__.py:573-576:
if args.predict_only:
args.log_samples = True
if (args.log_samples or args.predict_only) and not args.output_path:
raise ValueError("Specify --output_path if providing --log_samples or --predict_only")