Heuristic:EvolvingLMMs Lab Lmms eval Limit Flag Testing Only

Knowledge Sources	lmms-eval Common user mistake in evaluation reporting
Domains	Evaluation, Best_Practices
Last Updated	2026-02-14 00:00 GMT

Overview

The --limit CLI flag should only be used for testing and debugging — never for computing real benchmark metrics.

Description

The --limit flag restricts the number of evaluation instances processed, allowing quick smoke tests of model + task configurations. However, metrics computed on limited data are statistically unreliable and non-representative of true model performance. The framework explicitly warns users about this with a prominently displayed warning message. Despite this, it remains a common mistake for users to report metrics obtained with --limit as official benchmark results.

Usage

Use this heuristic as a best practice reminder whenever setting up evaluation runs. Apply --limit only during development/debugging to verify that the pipeline runs end-to-end. Remove it for any evaluation where metrics will be reported or compared.

The Insight (Rule of Thumb)

Action: Use --limit N only for pipeline testing. Remove for real benchmarks.
Value: Common test values: --limit 8 or --limit 16 for quick sanity checks.
Trade-off: Fast iteration during development vs. invalid metrics if accidentally left in production runs.
Related: --predict_only forces log_samples=True and requires --output_path.

Reasoning

Benchmarks are designed with specific dataset sizes and distributions. Evaluating on a small subset introduces sampling bias and reduces statistical power. For example, a task with 1000 instances evaluated on 8 would have wide confidence intervals and could produce wildly different scores from run to run. The warning in the code emphasizes this is a testing-only feature.

Additionally, when --limit interacts with request caching, the framework temporarily ignores the limit and processes all documents to build a complete cache, then applies the limit afterward. This prevents incomplete caches.

Code evidence from lmms_eval/__main__.py:590-591:

if args.limit:
    eval_logger.warning(" --limit SHOULD ONLY BE USED FOR TESTING."
                        "REAL METRICS SHOULD NOT BE COMPUTED USING LIMIT.")

Cache interaction from lmms_eval/api/task.py:432-434:

# process all documents when caching is specified for simplicity
if cache_requests and (not cached_instances or rewrite_requests_cache) and limit is not None:
    limit = None

Predict-only mode validation from lmms_eval/__main__.py:573-576:

if args.predict_only:
    args.log_samples = True
if (args.log_samples or args.predict_only) and not args.output_path:
    raise ValueError("Specify --output_path if providing --log_samples or --predict_only")

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment