Implementation:EvolvingLMMs Lab Lmms eval Cli Evaluate Task Test

Knowledge Sources	lmms-eval
Domains	Testing, Evaluation
Last Updated	2026-02-14 00:00 GMT

Overview

Concrete tool for validating custom tasks by running limited evaluations with sample logging provided by the lmms-eval framework.

Description

The cli_evaluate() function is the main entry point for running evaluations from the command line. It parses arguments, initializes the accelerator for distributed execution, and delegates to simple_evaluate() which orchestrates the full evaluation pipeline: task loading, model instantiation, request building, inference, and metric computation.

For task testing purposes, three key CLI flags enable rapid validation:

--limit N: Restricts evaluation to the first N examples from the dataset. When N is an integer, exactly N samples are evaluated. When N is a float less than 1, it is interpreted as a percentage of the total dataset. This enables quick smoke tests that exercise the full pipeline without waiting for complete benchmark evaluation.

--log_samples: Enables per-sample output logging. When active, the framework saves a JSON file containing each document's prompt, the model's raw output, the ground truth target, and the computed per-sample metric values. This file is invaluable for debugging prompt construction and scoring logic.

--predict_only: Runs model inference and logs outputs without computing metrics. This is useful for iterating on prompt design and generation parameters before implementing or debugging metric functions.

The simple_evaluate() function handles the orchestration:

Initializes the task manager and loads requested tasks
Instantiates the model from the specified model name and arguments
Builds evaluation requests for each task
Runs model inference with batching
Applies filters and computes metrics (unless predict_only)
Returns results dict and optional sample logs

After evaluation, cli_evaluate() prints a formatted results table using make_table() and optionally logs to Weights & Biases.

Usage

Use this to validate your custom task after creating the YAML configuration and utility functions. Run with --limit 8 --log_samples for initial testing, then without --limit for full evaluation. Use --predict_only when debugging prompts independently of metrics.

Code Reference

Source Location

Repository: lmms-eval
File: lmms_eval/__main__.py (lines 445-544 for cli_evaluate), lmms_eval/evaluator.py (lines 51-150 for simple_evaluate signature)

Signature

# CLI entry point
def cli_evaluate(args: Union[argparse.Namespace, None] = None) -> None:


# Core evaluation function
@positional_deprecated
def simple_evaluate(
    model,
    model_args: Optional[Union[str, dict]] = None,
    tasks: Optional[List[Union[str, dict, object]]] = None,
    num_fewshot: Optional[int] = None,
    batch_size: Optional[Union[int, str]] = None,
    device: Optional[str] = None,
    limit: Optional[Union[int, float]] = None,
    offset: int = 0,
    log_samples: bool = True,
    predict_only: bool = False,
    gen_kwargs: Optional[str] = None,
    task_manager: Optional[TaskManager] = None,
    verbosity: str = "INFO",
    # ... additional parameters
) -> Tuple[dict, Optional[dict]]:

Import

from lmms_eval.__main__ import cli_evaluate
from lmms_eval.evaluator import simple_evaluate

I/O Contract

Inputs

Name	Type	Required	Description
--model	`str`	Yes	Name of the model to evaluate (e.g., `"llava"`, `"qwen2_5_vl"`, `"gpt4v"`).
--model_args	`str`	Yes	Comma-separated key=value arguments for model construction (e.g., `"pretrained=Qwen/Qwen2.5-VL-3B-Instruct,max_pixels=12845056"`).
--tasks	`str`	Yes	Comma-separated list of task names to evaluate (e.g., `"mme"`, `"mmmu,mme"`).
--limit	`Optional[int]`	No	Number of examples to evaluate per task. Use for testing (e.g., `8`).
--log_samples	`bool`	No	If set, enables per-sample output logging to JSON files. Defaults to True.
--predict_only	`bool`	No	If set, generates model outputs without computing metrics.
--batch_size	`Union[int, str]`	No	Batch size for model inference. Use `"auto"` for automatic detection.
--device	`Optional[str]`	No	PyTorch device for model execution (e.g., `"cuda:0"`).
--output_path	`Optional[str]`	No	Directory to save results and sample logs.
--include_path	`Optional[str]`	No	Additional directory to search for task definitions.
--verbosity	`str`	No	Logging verbosity level (default: `"INFO"`). Set to `"DEBUG"` for detailed error traces.

Outputs

Name	Type	Description
Results table	`stdout`	A formatted table printed to stdout showing metric scores for each task and group.
Results JSON	`file`	A JSON file (when `--output_path` is specified) containing the full results dictionary with task configs, metric scores, and metadata.
Sample logs	`file`	JSON files (when `--log_samples` is active) containing per-sample prompts, model outputs, targets, and metric values.

Usage Examples

Basic Example: Smoke Test

# Test a custom task with 8 samples, logging outputs
# Run from command line:
# python -m lmms_eval \
#     --model llava \
#     --model_args pretrained=liuhaotian/llava-v1.5-7b \
#     --tasks my_custom_task \
#     --limit 8 \
#     --log_samples \
#     --batch_size 1 \
#     --output_path ./test_results

Predict-Only Mode

# Generate outputs without computing metrics
# python -m lmms_eval \
#     --model qwen2_5_vl \
#     --model_args pretrained=Qwen/Qwen2.5-VL-3B-Instruct \
#     --tasks my_custom_task \
#     --limit 8 \
#     --predict_only \
#     --batch_size 1 \
#     --output_path ./debug_outputs

Full Evaluation with Custom Task Directory

# Run full evaluation including external task definitions
# python -m lmms_eval \
#     --model qwen2_5_vl \
#     --model_args pretrained=Qwen/Qwen2.5-VL-3B-Instruct,max_pixels=12845056,attn_implementation=sdpa \
#     --tasks my_custom_task \
#     --batch_size 128 \
#     --include_path /path/to/my/custom_tasks \
#     --output_path ./results \
#     --log_samples

Programmatic Evaluation

from lmms_eval.evaluator import simple_evaluate

results, samples = simple_evaluate(
    model="llava",
    model_args="pretrained=liuhaotian/llava-v1.5-7b",
    tasks=["my_custom_task"],
    limit=8,
    log_samples=True,
    batch_size=1,
    device="cuda:0",
)

# Inspect results
for task_name, task_results in results["results"].items():
    print(f"{task_name}:")
    for metric, value in task_results.items():
        print(f"  {metric}: {value}")

# Inspect logged samples
if samples:
    for task_name, task_samples in samples.items():
        for sample in task_samples[:3]:
            print(f"Prompt: {sample.get('doc', {}).get('question', '')}")
            print(f"Output: {sample.get('resps', '')}")
            print(f"Target: {sample.get('target', '')}")
            print("---")

Related Pages

Implements Principle

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment