Workflow:EvolvingLMMs Lab Lmms eval Custom Model Integration
| Knowledge Sources | |
|---|---|
| Domains | LLMs, Multimodal_Evaluation, Model_Integration |
| Last Updated | 2026-02-14 00:00 GMT |
Overview
End-to-end process for integrating a new multimodal model into the lmms-eval evaluation framework, from implementing the model wrapper class through registry registration and evaluation testing.
Description
This workflow covers adding support for a new model architecture to lmms-eval. All models implement a wrapper class that subclasses the abstract lmms base class, providing generate_until() for open-ended generation tasks and loglikelihood() for multiple-choice scoring. The framework supports two model interfaces: chat models (recommended, using structured doc_to_messages) and simple/legacy models (using doc_to_visual + doc_to_text with image placeholder tokens). Models are registered via the ModelRegistryV2 and discovered by the --model CLI argument.
Usage
Execute this workflow when you need to evaluate a model architecture that is not yet supported by lmms-eval. This applies when adding support for a new open-source model family, integrating a proprietary API-based model, or wrapping an inference server (vLLM, SGLang) for high-throughput evaluation.
Execution Steps
Step 1: Model Type Selection
Determine whether the model should use the chat interface (recommended) or the simple/legacy interface. Chat models receive structured messages with roles and typed content (text, images, video, audio) via the ChatMessages protocol. Simple models receive a plain text prompt with image placeholder tokens and a separate visual input list. New models should use the chat interface unless there is a specific reason to use the legacy path.
Key considerations:
- Chat models set is_simple = False and receive doc_to_messages in Instance.args
- Simple models set is_simple = True and receive doc_to_visual + doc_to_text separately
- The evaluator automatically routes tasks to the appropriate interface based on the model's is_simple flag
- Reference implementations: qwen2_5_vl.py (chat), instructblip.py (simple)
Step 2: Model Wrapper Implementation
Create a new Python file in lmms_eval/models/chat/ (for chat models) or lmms_eval/models/simple/ (for legacy models). The class must subclass lmms_eval.api.model.lmms and implement two core methods: generate_until() for open-ended text generation given multimodal prompts, and loglikelihood() for computing log-probabilities of target continuations in multiple-choice tasks.
Key considerations:
- generate_until receives Instance objects whose args contain the prompt construction function, generation kwargs, doc_id, task name, and split
- loglikelihood receives Instance objects for scoring candidate answers against contexts
- Handle model loading, tokenization, and device management in __init__
- Implement proper batching for throughput (respecting self.batch_size)
- Use self.task_dict to access the dataset for retrieving documents by doc_id
Step 3: Media Handling
Implement multimodal input processing for the model's supported modalities. For chat models, use the ChatMessages protocol to extract images, videos, and audio from structured messages. Convert media into the format expected by the model's processor (PIL images, video tensors, audio waveforms). Handle mixed-modality inputs and variable numbers of media items per request.
Key considerations:
- ChatMessages.extract_media() returns separate lists for images, videos, and audios
- ChatMessages.to_hf_messages() converts to the HuggingFace message format for tokenizer chat templates
- Video inputs may arrive as file paths requiring frame extraction
- Audio inputs may require resampling to the model's expected sample rate
Step 4: Registry Registration
Register the model so lmms-eval can discover it via the --model CLI argument. Use the @register_model decorator on the class, then add a ModelManifest entry in lmms_eval/models/__init__.py. The manifest maps the model_id string to the class import path, specifying whether it supports chat_class_path, simple_class_path, or both.
Key considerations:
- The model_id in @register_model must match the --model CLI argument
- Add to AVAILABLE_CHAT_TEMPLATE_MODELS (chat) or AVAILABLE_SIMPLE_MODELS (simple) dictionaries
- The ModelRegistryV2 also supports entry-point-based plugin registration for external packages
- Aliases can be defined in the ModelManifest for alternative model names
Step 5: Example Script Creation
Create an example shell script in examples/models/ demonstrating how to run the model with typical evaluation tasks. The script should show representative --model_args, recommended --batch_size, and commonly used --tasks for the model's target modalities. Include comments explaining model-specific parameters.
Key considerations:
- Show both minimal and full evaluation examples
- Document any environment variables or prerequisites (API keys, model downloads)
- Include recommended generation kwargs if they differ from defaults
- Test the script end-to-end before submitting
Step 6: Testing and Validation
Validate the model integration by running evaluations on representative tasks. Test with a small --limit first, then verify results match expected baselines. Check that both generate_until and loglikelihood work correctly if the model supports both. Test with batch_size > 1 to verify batching logic, and in multi-GPU mode to verify distributed compatibility.
Key considerations:
- Test with at least one generation task (e.g., mme) and one multiple-choice task (e.g., seedbench_ppl)
- Use --log_samples to inspect model outputs for correctness
- Compare results against published baselines for the model
- Verify memory usage and throughput are reasonable