Implementation:EvolvingLMMs Lab Lmms eval Cli Evaluate Model Test

Knowledge Sources	lmms-eval
Domains	Testing, Model_Management
Last Updated	2026-02-14 00:00 GMT

Overview

Concrete tool for validating custom model integration with limited evaluation runs provided by the lmms-eval framework.

Description

The cli_evaluate function in lmms_eval/__main__.py is the main entry point for running evaluations from the command line. Combined with the --limit flag, it provides a fast validation mechanism for newly integrated models. The function orchestrates argument parsing, model instantiation, task building, evaluation dispatch, and result reporting.

Internally, cli_evaluate delegates to evaluator.simple_evaluate(), which handles model resolution via the registry, task dictionary construction, request building, and the core evaluation loop. The limit parameter flows through to evaluate(), where it caps the number of documents processed per task.

The --log_samples flag triggers per-sample output saving, which writes detailed JSON files containing the document, model arguments, raw responses, filtered responses, and computed metrics for each evaluation example.

Usage

Use this CLI command pattern immediately after completing a model integration to verify that the model loads, processes inputs, generates outputs, and produces metrics without errors. Start with --limit 8 and increase gradually as confidence grows.

Code Reference

Source Location

Repository: lmms-eval
File (CLI entry): lmms_eval/__main__.py, Lines L445-544
File (evaluator): lmms_eval/evaluator.py, Lines L191-204

Signature

# CLI entry point
def cli_evaluate(args: Union[argparse.Namespace, None] = None) -> None: ...

# Core evaluation function
def simple_evaluate(
    model,
    model_args: Optional[Union[str, dict]] = None,
    tasks: Optional[List[Union[str, dict, object]]] = None,
    num_fewshot: Optional[int] = None,
    batch_size: Optional[Union[int, str]] = None,
    max_batch_size: Optional[int] = None,
    device: Optional[str] = None,
    limit: Optional[Union[int, float]] = None,
    offset: int = 0,
    log_samples: bool = True,
    force_simple: bool = False,
    # ... additional parameters
): ...

Import

# Typically invoked from the command line, not imported directly:
# python -m lmms_eval --model <name> --model_args <args> --tasks <task> --limit 8

# For programmatic use:
from lmms_eval.evaluator import simple_evaluate

I/O Contract

Inputs

Name	Type	Required	Description
--model	`str`	Yes	Registered model name to test (e.g., `my_custom_model`).
--model_args	`str`	Yes	Comma-separated constructor arguments: `pretrained=<path>,max_pixels=<N>`.
--tasks	`str`	Yes	Comma-separated task names for testing (e.g., `mme`).
--limit	`int/float`	No (but recommended for testing)	Number of examples per task. Use `8` for quick smoke tests. If `<1`, treated as a percentage.
--log_samples	flag	No	When present, saves all model outputs and documents for per-sample inspection.
--force_simple	flag	No	When present, forces the simple protocol even if a chat implementation exists.
--output_path	`str`	Required if `--log_samples`	Directory path for saving result files and per-sample logs.
--device	`str`	No	Target device (e.g., `cuda:0`). If omitted, determined by the Accelerator.
--batch_size	`int/str`	No (default 1)	Batch size for model inference. Use `1` when testing to minimize memory issues.
--verbosity	`str`	No (default `"INFO"`)	Set to `"DEBUG"` for detailed error tracebacks during testing.

Outputs

Name	Type	Description
Console table	text	Formatted results table showing per-task metrics (accuracy, score, etc.).
Results JSON	`dict`	Aggregated results including model config, task configs, versions, metrics, and sample counts.
Sample logs	JSON files	Per-task JSONL files containing `doc_id`, `doc`, `target`, `arguments`, `resps`, `filtered_resps`, and computed metrics for each example.

Usage Examples

Quick Smoke Test

# Minimal test: 8 examples, single task, single GPU
python -m lmms_eval \
    --model my_custom_model \
    --model_args pretrained=my-org/my-model \
    --tasks mme \
    --limit 8 \
    --device cuda:0

Test with Sample Logging

# Save per-sample outputs for manual inspection
python -m lmms_eval \
    --model my_custom_model \
    --model_args pretrained=my-org/my-model,max_pixels=12845056 \
    --tasks mme \
    --limit 8 \
    --log_samples \
    --output_path ./test_results/ \
    --device cuda:0

Test Both Protocols

# Test chat protocol (default if chat_class_path is registered)
python -m lmms_eval \
    --model my_custom_model \
    --model_args pretrained=my-org/my-model \
    --tasks mme \
    --limit 8

# Test simple protocol explicitly
python -m lmms_eval \
    --model my_custom_model \
    --model_args pretrained=my-org/my-model \
    --tasks mme \
    --limit 8 \
    --force_simple

Debug Mode with Verbose Output

# Enable debug verbosity for detailed error tracebacks
python -m lmms_eval \
    --model my_custom_model \
    --model_args pretrained=my-org/my-model \
    --tasks mme \
    --limit 8 \
    --verbosity DEBUG \
    --device cuda:0

Programmatic Testing

from lmms_eval.evaluator import simple_evaluate

results = simple_evaluate(
    model="my_custom_model",
    model_args="pretrained=my-org/my-model",
    tasks=["mme"],
    limit=8,
    batch_size=1,
    device="cuda:0",
    log_samples=True,
)

# Check results
if results is not None:
    for task_name, task_results in results["results"].items():
        print(f"{task_name}: {task_results}")

Related Pages

Implements Principle

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment