Principle:EvolvingLMMs Lab Lmms eval Example Script Creation

Knowledge Sources	lmms-eval
Domains	Testing, Evaluation
Last Updated	2026-02-14 00:00 GMT

Overview

Creating reproducible evaluation scripts for model testing and benchmarking captures the complete invocation recipe for consistent results.

Description

After implementing and registering a custom model, creating a shell script that documents the exact evaluation command is critical for reproducibility and team collaboration. The lmms-eval framework provides a standard invocation pattern that combines the HuggingFace Accelerate launcher with the framework's CLI arguments.

The evaluation script pattern has three key components:

1. Distributed Launcher: The accelerate launch command handles multi-GPU distribution. The --num_processes flag controls how many GPU processes are spawned, and --main_process_port sets the communication port. For single-GPU evaluation, --num_processes=1 is sufficient. Alternatively, python -m lmms_eval can be used directly for single-process execution.

2. Model Configuration: The --model flag specifies the registered model name, and --model_args provides a comma-separated list of key=value pairs passed to the model constructor. Common arguments include:

pretrained=<path_or_hub_id> -- the model checkpoint.
max_pixels=<N> -- maximum pixel count for image resizing.
attn_implementation=<type> -- attention implementation (e.g., flash_attention_2, sdpa).

3. Task and Batch Settings: The --tasks flag accepts comma-separated task names, and --batch_size controls how many samples are processed together. Task names can be individual benchmarks (e.g., mme) or task groups (e.g., mmmu).

The script pattern also supports environment variable configuration, particularly HF_HOME for controlling the HuggingFace cache location, which is important in shared compute environments.

Usage

Create an evaluation shell script when:

You have completed a model integration and want to document the reference evaluation command.
You need to share an exact reproduction recipe with collaborators or in a pull request.
You want to automate evaluation runs in CI/CD or batch job systems.

Place scripts in the examples/models/ directory following the naming convention <model_name>.sh.

Theoretical Basis

A well-structured evaluation script captures all parameters needed for deterministic reproduction:

Evaluation Script = {
    Environment:  HF_HOME, CUDA_VISIBLE_DEVICES, etc.
    Launcher:     accelerate launch --num_processes=N --main_process_port=P
    Entry Point:  -m lmms_eval
    Model Config: --model <name> --model_args=<key=value,...>
    Task Config:  --tasks <task1,task2,...>
    Batch Config: --batch_size N
    Optional:     --limit N, --log_samples, --output_path <path>
}

The accelerate launch command internally manages:

Process spawning across available GPUs.
Rank assignment and communication backend initialization.
Environment variable propagation to child processes.

The framework then handles:

Parsing --model_args via simple_parse_args_string().
Model class resolution via ModelRegistryV2.
Model instantiation via create_from_arg_string().
Task building, request dispatching, and result aggregation.

For multi-GPU evaluation, data is distributed across ranks using the task's build_all_requests(rank, world_size) method, and results are gathered back to rank 0 using torch.distributed.gather_object.

Related Pages

Implemented By

Implementation:EvolvingLMMs_Lab_Lmms_eval_Evaluation_Shell_Script

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment