Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:EvolvingLMMs Lab Lmms eval Cli Evaluate Model Test

From Leeroopedia
Knowledge Sources
Domains Testing, Model_Management
Last Updated 2026-02-14 00:00 GMT

Overview

Concrete tool for validating custom model integration with limited evaluation runs provided by the lmms-eval framework.

Description

The cli_evaluate function in lmms_eval/__main__.py is the main entry point for running evaluations from the command line. Combined with the --limit flag, it provides a fast validation mechanism for newly integrated models. The function orchestrates argument parsing, model instantiation, task building, evaluation dispatch, and result reporting.

Internally, cli_evaluate delegates to evaluator.simple_evaluate(), which handles model resolution via the registry, task dictionary construction, request building, and the core evaluation loop. The limit parameter flows through to evaluate(), where it caps the number of documents processed per task.

The --log_samples flag triggers per-sample output saving, which writes detailed JSON files containing the document, model arguments, raw responses, filtered responses, and computed metrics for each evaluation example.

Usage

Use this CLI command pattern immediately after completing a model integration to verify that the model loads, processes inputs, generates outputs, and produces metrics without errors. Start with --limit 8 and increase gradually as confidence grows.

Code Reference

Source Location

  • Repository: lmms-eval
  • File (CLI entry): lmms_eval/__main__.py, Lines L445-544
  • File (evaluator): lmms_eval/evaluator.py, Lines L191-204

Signature

# CLI entry point
def cli_evaluate(args: Union[argparse.Namespace, None] = None) -> None: ...

# Core evaluation function
def simple_evaluate(
    model,
    model_args: Optional[Union[str, dict]] = None,
    tasks: Optional[List[Union[str, dict, object]]] = None,
    num_fewshot: Optional[int] = None,
    batch_size: Optional[Union[int, str]] = None,
    max_batch_size: Optional[int] = None,
    device: Optional[str] = None,
    limit: Optional[Union[int, float]] = None,
    offset: int = 0,
    log_samples: bool = True,
    force_simple: bool = False,
    # ... additional parameters
): ...

Import

# Typically invoked from the command line, not imported directly:
# python -m lmms_eval --model <name> --model_args <args> --tasks <task> --limit 8

# For programmatic use:
from lmms_eval.evaluator import simple_evaluate

I/O Contract

Inputs

Name Type Required Description
--model str Yes Registered model name to test (e.g., my_custom_model).
--model_args str Yes Comma-separated constructor arguments: pretrained=<path>,max_pixels=<N>.
--tasks str Yes Comma-separated task names for testing (e.g., mme).
--limit int/float No (but recommended for testing) Number of examples per task. Use 8 for quick smoke tests. If <1, treated as a percentage.
--log_samples flag No When present, saves all model outputs and documents for per-sample inspection.
--force_simple flag No When present, forces the simple protocol even if a chat implementation exists.
--output_path str Required if --log_samples Directory path for saving result files and per-sample logs.
--device str No Target device (e.g., cuda:0). If omitted, determined by the Accelerator.
--batch_size int/str No (default 1) Batch size for model inference. Use 1 when testing to minimize memory issues.
--verbosity str No (default "INFO") Set to "DEBUG" for detailed error tracebacks during testing.

Outputs

Name Type Description
Console table text Formatted results table showing per-task metrics (accuracy, score, etc.).
Results JSON dict Aggregated results including model config, task configs, versions, metrics, and sample counts.
Sample logs JSON files Per-task JSONL files containing doc_id, doc, target, arguments, resps, filtered_resps, and computed metrics for each example.

Usage Examples

Quick Smoke Test

# Minimal test: 8 examples, single task, single GPU
python -m lmms_eval \
    --model my_custom_model \
    --model_args pretrained=my-org/my-model \
    --tasks mme \
    --limit 8 \
    --device cuda:0

Test with Sample Logging

# Save per-sample outputs for manual inspection
python -m lmms_eval \
    --model my_custom_model \
    --model_args pretrained=my-org/my-model,max_pixels=12845056 \
    --tasks mme \
    --limit 8 \
    --log_samples \
    --output_path ./test_results/ \
    --device cuda:0

Test Both Protocols

# Test chat protocol (default if chat_class_path is registered)
python -m lmms_eval \
    --model my_custom_model \
    --model_args pretrained=my-org/my-model \
    --tasks mme \
    --limit 8

# Test simple protocol explicitly
python -m lmms_eval \
    --model my_custom_model \
    --model_args pretrained=my-org/my-model \
    --tasks mme \
    --limit 8 \
    --force_simple

Debug Mode with Verbose Output

# Enable debug verbosity for detailed error tracebacks
python -m lmms_eval \
    --model my_custom_model \
    --model_args pretrained=my-org/my-model \
    --tasks mme \
    --limit 8 \
    --verbosity DEBUG \
    --device cuda:0

Programmatic Testing

from lmms_eval.evaluator import simple_evaluate

results = simple_evaluate(
    model="my_custom_model",
    model_args="pretrained=my-org/my-model",
    tasks=["mme"],
    limit=8,
    batch_size=1,
    device="cuda:0",
    log_samples=True,
)

# Check results
if results is not None:
    for task_name, task_results in results["results"].items():
        print(f"{task_name}: {task_results}")

Related Pages

Implements Principle

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment