Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:EvolvingLMMs Lab Lmms eval Cli Evaluate Task Test

From Leeroopedia
Knowledge Sources
Domains Testing, Evaluation
Last Updated 2026-02-14 00:00 GMT

Overview

Concrete tool for validating custom tasks by running limited evaluations with sample logging provided by the lmms-eval framework.

Description

The cli_evaluate() function is the main entry point for running evaluations from the command line. It parses arguments, initializes the accelerator for distributed execution, and delegates to simple_evaluate() which orchestrates the full evaluation pipeline: task loading, model instantiation, request building, inference, and metric computation.

For task testing purposes, three key CLI flags enable rapid validation:

--limit N: Restricts evaluation to the first N examples from the dataset. When N is an integer, exactly N samples are evaluated. When N is a float less than 1, it is interpreted as a percentage of the total dataset. This enables quick smoke tests that exercise the full pipeline without waiting for complete benchmark evaluation.

--log_samples: Enables per-sample output logging. When active, the framework saves a JSON file containing each document's prompt, the model's raw output, the ground truth target, and the computed per-sample metric values. This file is invaluable for debugging prompt construction and scoring logic.

--predict_only: Runs model inference and logs outputs without computing metrics. This is useful for iterating on prompt design and generation parameters before implementing or debugging metric functions.

The simple_evaluate() function handles the orchestration:

  1. Initializes the task manager and loads requested tasks
  2. Instantiates the model from the specified model name and arguments
  3. Builds evaluation requests for each task
  4. Runs model inference with batching
  5. Applies filters and computes metrics (unless predict_only)
  6. Returns results dict and optional sample logs

After evaluation, cli_evaluate() prints a formatted results table using make_table() and optionally logs to Weights & Biases.

Usage

Use this to validate your custom task after creating the YAML configuration and utility functions. Run with --limit 8 --log_samples for initial testing, then without --limit for full evaluation. Use --predict_only when debugging prompts independently of metrics.

Code Reference

Source Location

  • Repository: lmms-eval
  • File: lmms_eval/__main__.py (lines 445-544 for cli_evaluate), lmms_eval/evaluator.py (lines 51-150 for simple_evaluate signature)

Signature

# CLI entry point
def cli_evaluate(args: Union[argparse.Namespace, None] = None) -> None:


# Core evaluation function
@positional_deprecated
def simple_evaluate(
    model,
    model_args: Optional[Union[str, dict]] = None,
    tasks: Optional[List[Union[str, dict, object]]] = None,
    num_fewshot: Optional[int] = None,
    batch_size: Optional[Union[int, str]] = None,
    device: Optional[str] = None,
    limit: Optional[Union[int, float]] = None,
    offset: int = 0,
    log_samples: bool = True,
    predict_only: bool = False,
    gen_kwargs: Optional[str] = None,
    task_manager: Optional[TaskManager] = None,
    verbosity: str = "INFO",
    # ... additional parameters
) -> Tuple[dict, Optional[dict]]:

Import

from lmms_eval.__main__ import cli_evaluate
from lmms_eval.evaluator import simple_evaluate

I/O Contract

Inputs

Name Type Required Description
--model str Yes Name of the model to evaluate (e.g., "llava", "qwen2_5_vl", "gpt4v").
--model_args str Yes Comma-separated key=value arguments for model construction (e.g., "pretrained=Qwen/Qwen2.5-VL-3B-Instruct,max_pixels=12845056").
--tasks str Yes Comma-separated list of task names to evaluate (e.g., "mme", "mmmu,mme").
--limit Optional[int] No Number of examples to evaluate per task. Use for testing (e.g., 8).
--log_samples bool No If set, enables per-sample output logging to JSON files. Defaults to True.
--predict_only bool No If set, generates model outputs without computing metrics.
--batch_size Union[int, str] No Batch size for model inference. Use "auto" for automatic detection.
--device Optional[str] No PyTorch device for model execution (e.g., "cuda:0").
--output_path Optional[str] No Directory to save results and sample logs.
--include_path Optional[str] No Additional directory to search for task definitions.
--verbosity str No Logging verbosity level (default: "INFO"). Set to "DEBUG" for detailed error traces.

Outputs

Name Type Description
Results table stdout A formatted table printed to stdout showing metric scores for each task and group.
Results JSON file A JSON file (when --output_path is specified) containing the full results dictionary with task configs, metric scores, and metadata.
Sample logs file JSON files (when --log_samples is active) containing per-sample prompts, model outputs, targets, and metric values.

Usage Examples

Basic Example: Smoke Test

# Test a custom task with 8 samples, logging outputs
# Run from command line:
# python -m lmms_eval \
#     --model llava \
#     --model_args pretrained=liuhaotian/llava-v1.5-7b \
#     --tasks my_custom_task \
#     --limit 8 \
#     --log_samples \
#     --batch_size 1 \
#     --output_path ./test_results

Predict-Only Mode

# Generate outputs without computing metrics
# python -m lmms_eval \
#     --model qwen2_5_vl \
#     --model_args pretrained=Qwen/Qwen2.5-VL-3B-Instruct \
#     --tasks my_custom_task \
#     --limit 8 \
#     --predict_only \
#     --batch_size 1 \
#     --output_path ./debug_outputs

Full Evaluation with Custom Task Directory

# Run full evaluation including external task definitions
# python -m lmms_eval \
#     --model qwen2_5_vl \
#     --model_args pretrained=Qwen/Qwen2.5-VL-3B-Instruct,max_pixels=12845056,attn_implementation=sdpa \
#     --tasks my_custom_task \
#     --batch_size 128 \
#     --include_path /path/to/my/custom_tasks \
#     --output_path ./results \
#     --log_samples

Programmatic Evaluation

from lmms_eval.evaluator import simple_evaluate

results, samples = simple_evaluate(
    model="llava",
    model_args="pretrained=liuhaotian/llava-v1.5-7b",
    tasks=["my_custom_task"],
    limit=8,
    log_samples=True,
    batch_size=1,
    device="cuda:0",
)

# Inspect results
for task_name, task_results in results["results"].items():
    print(f"{task_name}:")
    for metric, value in task_results.items():
        print(f"  {metric}: {value}")

# Inspect logged samples
if samples:
    for task_name, task_samples in samples.items():
        for sample in task_samples[:3]:
            print(f"Prompt: {sample.get('doc', {}).get('question', '')}")
            print(f"Output: {sample.get('resps', '')}")
            print(f"Target: {sample.get('target', '')}")
            print("---")

Related Pages

Implements Principle

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment