Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Heuristic:EvolvingLMMs Lab Lmms eval Limit Flag Testing Only

From Leeroopedia
Revision as of 10:54, 16 February 2026 by Admin (talk | contribs) (Auto-imported from heuristics/EvolvingLMMs_Lab_Lmms_eval_Limit_Flag_Testing_Only.md)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Knowledge Sources
Domains Evaluation, Best_Practices
Last Updated 2026-02-14 00:00 GMT

Overview

The --limit CLI flag should only be used for testing and debugging — never for computing real benchmark metrics.

Description

The --limit flag restricts the number of evaluation instances processed, allowing quick smoke tests of model + task configurations. However, metrics computed on limited data are statistically unreliable and non-representative of true model performance. The framework explicitly warns users about this with a prominently displayed warning message. Despite this, it remains a common mistake for users to report metrics obtained with --limit as official benchmark results.

Usage

Use this heuristic as a best practice reminder whenever setting up evaluation runs. Apply --limit only during development/debugging to verify that the pipeline runs end-to-end. Remove it for any evaluation where metrics will be reported or compared.

The Insight (Rule of Thumb)

  • Action: Use --limit N only for pipeline testing. Remove for real benchmarks.
  • Value: Common test values: --limit 8 or --limit 16 for quick sanity checks.
  • Trade-off: Fast iteration during development vs. invalid metrics if accidentally left in production runs.
  • Related: --predict_only forces log_samples=True and requires --output_path.

Reasoning

Benchmarks are designed with specific dataset sizes and distributions. Evaluating on a small subset introduces sampling bias and reduces statistical power. For example, a task with 1000 instances evaluated on 8 would have wide confidence intervals and could produce wildly different scores from run to run. The warning in the code emphasizes this is a testing-only feature.

Additionally, when --limit interacts with request caching, the framework temporarily ignores the limit and processes all documents to build a complete cache, then applies the limit afterward. This prevents incomplete caches.

Code evidence from lmms_eval/__main__.py:590-591:

if args.limit:
    eval_logger.warning(" --limit SHOULD ONLY BE USED FOR TESTING."
                        "REAL METRICS SHOULD NOT BE COMPUTED USING LIMIT.")

Cache interaction from lmms_eval/api/task.py:432-434:

# process all documents when caching is specified for simplicity
if cache_requests and (not cached_instances or rewrite_requests_cache) and limit is not None:
    limit = None

Predict-only mode validation from lmms_eval/__main__.py:573-576:

if args.predict_only:
    args.log_samples = True
if (args.log_samples or args.predict_only) and not args.output_path:
    raise ValueError("Specify --output_path if providing --log_samples or --predict_only")

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment