Implementation:EvolvingLMMs Lab Lmms eval Cli Evaluate Model Test
| Knowledge Sources | |
|---|---|
| Domains | Testing, Model_Management |
| Last Updated | 2026-02-14 00:00 GMT |
Overview
Concrete tool for validating custom model integration with limited evaluation runs provided by the lmms-eval framework.
Description
The cli_evaluate function in lmms_eval/__main__.py is the main entry point for running evaluations from the command line. Combined with the --limit flag, it provides a fast validation mechanism for newly integrated models. The function orchestrates argument parsing, model instantiation, task building, evaluation dispatch, and result reporting.
Internally, cli_evaluate delegates to evaluator.simple_evaluate(), which handles model resolution via the registry, task dictionary construction, request building, and the core evaluation loop. The limit parameter flows through to evaluate(), where it caps the number of documents processed per task.
The --log_samples flag triggers per-sample output saving, which writes detailed JSON files containing the document, model arguments, raw responses, filtered responses, and computed metrics for each evaluation example.
Usage
Use this CLI command pattern immediately after completing a model integration to verify that the model loads, processes inputs, generates outputs, and produces metrics without errors. Start with --limit 8 and increase gradually as confidence grows.
Code Reference
Source Location
- Repository: lmms-eval
- File (CLI entry):
lmms_eval/__main__.py, Lines L445-544 - File (evaluator):
lmms_eval/evaluator.py, Lines L191-204
Signature
# CLI entry point
def cli_evaluate(args: Union[argparse.Namespace, None] = None) -> None: ...
# Core evaluation function
def simple_evaluate(
model,
model_args: Optional[Union[str, dict]] = None,
tasks: Optional[List[Union[str, dict, object]]] = None,
num_fewshot: Optional[int] = None,
batch_size: Optional[Union[int, str]] = None,
max_batch_size: Optional[int] = None,
device: Optional[str] = None,
limit: Optional[Union[int, float]] = None,
offset: int = 0,
log_samples: bool = True,
force_simple: bool = False,
# ... additional parameters
): ...
Import
# Typically invoked from the command line, not imported directly:
# python -m lmms_eval --model <name> --model_args <args> --tasks <task> --limit 8
# For programmatic use:
from lmms_eval.evaluator import simple_evaluate
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| --model | str |
Yes | Registered model name to test (e.g., my_custom_model).
|
| --model_args | str |
Yes | Comma-separated constructor arguments: pretrained=<path>,max_pixels=<N>.
|
| --tasks | str |
Yes | Comma-separated task names for testing (e.g., mme).
|
| --limit | int/float |
No (but recommended for testing) | Number of examples per task. Use 8 for quick smoke tests. If <1, treated as a percentage.
|
| --log_samples | flag | No | When present, saves all model outputs and documents for per-sample inspection. |
| --force_simple | flag | No | When present, forces the simple protocol even if a chat implementation exists. |
| --output_path | str |
Required if --log_samples |
Directory path for saving result files and per-sample logs. |
| --device | str |
No | Target device (e.g., cuda:0). If omitted, determined by the Accelerator.
|
| --batch_size | int/str |
No (default 1) | Batch size for model inference. Use 1 when testing to minimize memory issues.
|
| --verbosity | str |
No (default "INFO") |
Set to "DEBUG" for detailed error tracebacks during testing.
|
Outputs
| Name | Type | Description |
|---|---|---|
| Console table | text | Formatted results table showing per-task metrics (accuracy, score, etc.). |
| Results JSON | dict |
Aggregated results including model config, task configs, versions, metrics, and sample counts. |
| Sample logs | JSON files | Per-task JSONL files containing doc_id, doc, target, arguments, resps, filtered_resps, and computed metrics for each example.
|
Usage Examples
Quick Smoke Test
# Minimal test: 8 examples, single task, single GPU
python -m lmms_eval \
--model my_custom_model \
--model_args pretrained=my-org/my-model \
--tasks mme \
--limit 8 \
--device cuda:0
Test with Sample Logging
# Save per-sample outputs for manual inspection
python -m lmms_eval \
--model my_custom_model \
--model_args pretrained=my-org/my-model,max_pixels=12845056 \
--tasks mme \
--limit 8 \
--log_samples \
--output_path ./test_results/ \
--device cuda:0
Test Both Protocols
# Test chat protocol (default if chat_class_path is registered)
python -m lmms_eval \
--model my_custom_model \
--model_args pretrained=my-org/my-model \
--tasks mme \
--limit 8
# Test simple protocol explicitly
python -m lmms_eval \
--model my_custom_model \
--model_args pretrained=my-org/my-model \
--tasks mme \
--limit 8 \
--force_simple
Debug Mode with Verbose Output
# Enable debug verbosity for detailed error tracebacks
python -m lmms_eval \
--model my_custom_model \
--model_args pretrained=my-org/my-model \
--tasks mme \
--limit 8 \
--verbosity DEBUG \
--device cuda:0
Programmatic Testing
from lmms_eval.evaluator import simple_evaluate
results = simple_evaluate(
model="my_custom_model",
model_args="pretrained=my-org/my-model",
tasks=["mme"],
limit=8,
batch_size=1,
device="cuda:0",
log_samples=True,
)
# Check results
if results is not None:
for task_name, task_results in results["results"].items():
print(f"{task_name}: {task_results}")