Workflow:EvolvingLMMs Lab Lmms eval Custom Model Integration

Knowledge Sources	lmms-eval Model Guide
Domains	LLMs, Multimodal_Evaluation, Model_Integration
Last Updated	2026-02-14 00:00 GMT

Overview

End-to-end process for integrating a new multimodal model into the lmms-eval evaluation framework, from implementing the model wrapper class through registry registration and evaluation testing.

Description

This workflow covers adding support for a new model architecture to lmms-eval. All models implement a wrapper class that subclasses the abstract lmms base class, providing generate_until() for open-ended generation tasks and loglikelihood() for multiple-choice scoring. The framework supports two model interfaces: chat models (recommended, using structured doc_to_messages) and simple/legacy models (using doc_to_visual + doc_to_text with image placeholder tokens). Models are registered via the ModelRegistryV2 and discovered by the --model CLI argument.

Usage

Execute this workflow when you need to evaluate a model architecture that is not yet supported by lmms-eval. This applies when adding support for a new open-source model family, integrating a proprietary API-based model, or wrapping an inference server (vLLM, SGLang) for high-throughput evaluation.

Execution Steps

Step 1: Model Type Selection

Determine whether the model should use the chat interface (recommended) or the simple/legacy interface. Chat models receive structured messages with roles and typed content (text, images, video, audio) via the ChatMessages protocol. Simple models receive a plain text prompt with image placeholder tokens and a separate visual input list. New models should use the chat interface unless there is a specific reason to use the legacy path.

Key considerations:

Chat models set is_simple = False and receive doc_to_messages in Instance.args
Simple models set is_simple = True and receive doc_to_visual + doc_to_text separately
The evaluator automatically routes tasks to the appropriate interface based on the model's is_simple flag
Reference implementations: qwen2_5_vl.py (chat), instructblip.py (simple)

Step 2: Model Wrapper Implementation

Create a new Python file in lmms_eval/models/chat/ (for chat models) or lmms_eval/models/simple/ (for legacy models). The class must subclass lmms_eval.api.model.lmms and implement two core methods: generate_until() for open-ended text generation given multimodal prompts, and loglikelihood() for computing log-probabilities of target continuations in multiple-choice tasks.

Key considerations:

generate_until receives Instance objects whose args contain the prompt construction function, generation kwargs, doc_id, task name, and split
loglikelihood receives Instance objects for scoring candidate answers against contexts
Handle model loading, tokenization, and device management in __init__
Implement proper batching for throughput (respecting self.batch_size)
Use self.task_dict to access the dataset for retrieving documents by doc_id

Step 3: Media Handling

Implement multimodal input processing for the model's supported modalities. For chat models, use the ChatMessages protocol to extract images, videos, and audio from structured messages. Convert media into the format expected by the model's processor (PIL images, video tensors, audio waveforms). Handle mixed-modality inputs and variable numbers of media items per request.

Key considerations:

ChatMessages.extract_media() returns separate lists for images, videos, and audios
ChatMessages.to_hf_messages() converts to the HuggingFace message format for tokenizer chat templates
Video inputs may arrive as file paths requiring frame extraction
Audio inputs may require resampling to the model's expected sample rate

Step 4: Registry Registration

Register the model so lmms-eval can discover it via the --model CLI argument. Use the @register_model decorator on the class, then add a ModelManifest entry in lmms_eval/models/__init__.py. The manifest maps the model_id string to the class import path, specifying whether it supports chat_class_path, simple_class_path, or both.

Key considerations:

The model_id in @register_model must match the --model CLI argument
Add to AVAILABLE_CHAT_TEMPLATE_MODELS (chat) or AVAILABLE_SIMPLE_MODELS (simple) dictionaries
The ModelRegistryV2 also supports entry-point-based plugin registration for external packages
Aliases can be defined in the ModelManifest for alternative model names

Step 5: Example Script Creation

Create an example shell script in examples/models/ demonstrating how to run the model with typical evaluation tasks. The script should show representative --model_args, recommended --batch_size, and commonly used --tasks for the model's target modalities. Include comments explaining model-specific parameters.

Key considerations:

Show both minimal and full evaluation examples
Document any environment variables or prerequisites (API keys, model downloads)
Include recommended generation kwargs if they differ from defaults
Test the script end-to-end before submitting

Step 6: Testing and Validation

Validate the model integration by running evaluations on representative tasks. Test with a small --limit first, then verify results match expected baselines. Check that both generate_until and loglikelihood work correctly if the model supports both. Test with batch_size > 1 to verify batching logic, and in multi-GPU mode to verify distributed compatibility.

Key considerations:

Test with at least one generation task (e.g., mme) and one multiple-choice task (e.g., seedbench_ppl)
Use --log_samples to inspect model outputs for correctness
Compare results against published baselines for the model
Verify memory usage and throughput are reasonable

Execution Diagram

GitHub URL

Workflow Repository