Implementation:EvolvingLMMs Lab Lmms eval Cli Evaluate Task Test
| Knowledge Sources | |
|---|---|
| Domains | Testing, Evaluation |
| Last Updated | 2026-02-14 00:00 GMT |
Overview
Concrete tool for validating custom tasks by running limited evaluations with sample logging provided by the lmms-eval framework.
Description
The cli_evaluate() function is the main entry point for running evaluations from the command line. It parses arguments, initializes the accelerator for distributed execution, and delegates to simple_evaluate() which orchestrates the full evaluation pipeline: task loading, model instantiation, request building, inference, and metric computation.
For task testing purposes, three key CLI flags enable rapid validation:
--limit N: Restricts evaluation to the first N examples from the dataset. When N is an integer, exactly N samples are evaluated. When N is a float less than 1, it is interpreted as a percentage of the total dataset. This enables quick smoke tests that exercise the full pipeline without waiting for complete benchmark evaluation.
--log_samples: Enables per-sample output logging. When active, the framework saves a JSON file containing each document's prompt, the model's raw output, the ground truth target, and the computed per-sample metric values. This file is invaluable for debugging prompt construction and scoring logic.
--predict_only: Runs model inference and logs outputs without computing metrics. This is useful for iterating on prompt design and generation parameters before implementing or debugging metric functions.
The simple_evaluate() function handles the orchestration:
- Initializes the task manager and loads requested tasks
- Instantiates the model from the specified model name and arguments
- Builds evaluation requests for each task
- Runs model inference with batching
- Applies filters and computes metrics (unless predict_only)
- Returns results dict and optional sample logs
After evaluation, cli_evaluate() prints a formatted results table using make_table() and optionally logs to Weights & Biases.
Usage
Use this to validate your custom task after creating the YAML configuration and utility functions. Run with --limit 8 --log_samples for initial testing, then without --limit for full evaluation. Use --predict_only when debugging prompts independently of metrics.
Code Reference
Source Location
- Repository: lmms-eval
- File:
lmms_eval/__main__.py(lines 445-544 for cli_evaluate),lmms_eval/evaluator.py(lines 51-150 for simple_evaluate signature)
Signature
# CLI entry point
def cli_evaluate(args: Union[argparse.Namespace, None] = None) -> None:
# Core evaluation function
@positional_deprecated
def simple_evaluate(
model,
model_args: Optional[Union[str, dict]] = None,
tasks: Optional[List[Union[str, dict, object]]] = None,
num_fewshot: Optional[int] = None,
batch_size: Optional[Union[int, str]] = None,
device: Optional[str] = None,
limit: Optional[Union[int, float]] = None,
offset: int = 0,
log_samples: bool = True,
predict_only: bool = False,
gen_kwargs: Optional[str] = None,
task_manager: Optional[TaskManager] = None,
verbosity: str = "INFO",
# ... additional parameters
) -> Tuple[dict, Optional[dict]]:
Import
from lmms_eval.__main__ import cli_evaluate
from lmms_eval.evaluator import simple_evaluate
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| --model | str |
Yes | Name of the model to evaluate (e.g., "llava", "qwen2_5_vl", "gpt4v").
|
| --model_args | str |
Yes | Comma-separated key=value arguments for model construction (e.g., "pretrained=Qwen/Qwen2.5-VL-3B-Instruct,max_pixels=12845056").
|
| --tasks | str |
Yes | Comma-separated list of task names to evaluate (e.g., "mme", "mmmu,mme").
|
| --limit | Optional[int] |
No | Number of examples to evaluate per task. Use for testing (e.g., 8).
|
| --log_samples | bool |
No | If set, enables per-sample output logging to JSON files. Defaults to True. |
| --predict_only | bool |
No | If set, generates model outputs without computing metrics. |
| --batch_size | Union[int, str] |
No | Batch size for model inference. Use "auto" for automatic detection.
|
| --device | Optional[str] |
No | PyTorch device for model execution (e.g., "cuda:0").
|
| --output_path | Optional[str] |
No | Directory to save results and sample logs. |
| --include_path | Optional[str] |
No | Additional directory to search for task definitions. |
| --verbosity | str |
No | Logging verbosity level (default: "INFO"). Set to "DEBUG" for detailed error traces.
|
Outputs
| Name | Type | Description |
|---|---|---|
| Results table | stdout |
A formatted table printed to stdout showing metric scores for each task and group. |
| Results JSON | file |
A JSON file (when --output_path is specified) containing the full results dictionary with task configs, metric scores, and metadata.
|
| Sample logs | file |
JSON files (when --log_samples is active) containing per-sample prompts, model outputs, targets, and metric values.
|
Usage Examples
Basic Example: Smoke Test
# Test a custom task with 8 samples, logging outputs
# Run from command line:
# python -m lmms_eval \
# --model llava \
# --model_args pretrained=liuhaotian/llava-v1.5-7b \
# --tasks my_custom_task \
# --limit 8 \
# --log_samples \
# --batch_size 1 \
# --output_path ./test_results
Predict-Only Mode
# Generate outputs without computing metrics
# python -m lmms_eval \
# --model qwen2_5_vl \
# --model_args pretrained=Qwen/Qwen2.5-VL-3B-Instruct \
# --tasks my_custom_task \
# --limit 8 \
# --predict_only \
# --batch_size 1 \
# --output_path ./debug_outputs
Full Evaluation with Custom Task Directory
# Run full evaluation including external task definitions
# python -m lmms_eval \
# --model qwen2_5_vl \
# --model_args pretrained=Qwen/Qwen2.5-VL-3B-Instruct,max_pixels=12845056,attn_implementation=sdpa \
# --tasks my_custom_task \
# --batch_size 128 \
# --include_path /path/to/my/custom_tasks \
# --output_path ./results \
# --log_samples
Programmatic Evaluation
from lmms_eval.evaluator import simple_evaluate
results, samples = simple_evaluate(
model="llava",
model_args="pretrained=liuhaotian/llava-v1.5-7b",
tasks=["my_custom_task"],
limit=8,
log_samples=True,
batch_size=1,
device="cuda:0",
)
# Inspect results
for task_name, task_results in results["results"].items():
print(f"{task_name}:")
for metric, value in task_results.items():
print(f" {metric}: {value}")
# Inspect logged samples
if samples:
for task_name, task_samples in samples.items():
for sample in task_samples[:3]:
print(f"Prompt: {sample.get('doc', {}).get('question', '')}")
print(f"Output: {sample.get('resps', '')}")
print(f"Target: {sample.get('target', '')}")
print("---")