Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Workflow:EvolvingLMMs Lab Lmms eval Custom Model Integration

From Leeroopedia
Knowledge Sources
Domains LLMs, Multimodal_Evaluation, Model_Integration
Last Updated 2026-02-14 00:00 GMT

Overview

End-to-end process for integrating a new multimodal model into the lmms-eval evaluation framework, from implementing the model wrapper class through registry registration and evaluation testing.

Description

This workflow covers adding support for a new model architecture to lmms-eval. All models implement a wrapper class that subclasses the abstract lmms base class, providing generate_until() for open-ended generation tasks and loglikelihood() for multiple-choice scoring. The framework supports two model interfaces: chat models (recommended, using structured doc_to_messages) and simple/legacy models (using doc_to_visual + doc_to_text with image placeholder tokens). Models are registered via the ModelRegistryV2 and discovered by the --model CLI argument.

Usage

Execute this workflow when you need to evaluate a model architecture that is not yet supported by lmms-eval. This applies when adding support for a new open-source model family, integrating a proprietary API-based model, or wrapping an inference server (vLLM, SGLang) for high-throughput evaluation.

Execution Steps

Step 1: Model Type Selection

Determine whether the model should use the chat interface (recommended) or the simple/legacy interface. Chat models receive structured messages with roles and typed content (text, images, video, audio) via the ChatMessages protocol. Simple models receive a plain text prompt with image placeholder tokens and a separate visual input list. New models should use the chat interface unless there is a specific reason to use the legacy path.

Key considerations:

  • Chat models set is_simple = False and receive doc_to_messages in Instance.args
  • Simple models set is_simple = True and receive doc_to_visual + doc_to_text separately
  • The evaluator automatically routes tasks to the appropriate interface based on the model's is_simple flag
  • Reference implementations: qwen2_5_vl.py (chat), instructblip.py (simple)

Step 2: Model Wrapper Implementation

Create a new Python file in lmms_eval/models/chat/ (for chat models) or lmms_eval/models/simple/ (for legacy models). The class must subclass lmms_eval.api.model.lmms and implement two core methods: generate_until() for open-ended text generation given multimodal prompts, and loglikelihood() for computing log-probabilities of target continuations in multiple-choice tasks.

Key considerations:

  • generate_until receives Instance objects whose args contain the prompt construction function, generation kwargs, doc_id, task name, and split
  • loglikelihood receives Instance objects for scoring candidate answers against contexts
  • Handle model loading, tokenization, and device management in __init__
  • Implement proper batching for throughput (respecting self.batch_size)
  • Use self.task_dict to access the dataset for retrieving documents by doc_id

Step 3: Media Handling

Implement multimodal input processing for the model's supported modalities. For chat models, use the ChatMessages protocol to extract images, videos, and audio from structured messages. Convert media into the format expected by the model's processor (PIL images, video tensors, audio waveforms). Handle mixed-modality inputs and variable numbers of media items per request.

Key considerations:

  • ChatMessages.extract_media() returns separate lists for images, videos, and audios
  • ChatMessages.to_hf_messages() converts to the HuggingFace message format for tokenizer chat templates
  • Video inputs may arrive as file paths requiring frame extraction
  • Audio inputs may require resampling to the model's expected sample rate

Step 4: Registry Registration

Register the model so lmms-eval can discover it via the --model CLI argument. Use the @register_model decorator on the class, then add a ModelManifest entry in lmms_eval/models/__init__.py. The manifest maps the model_id string to the class import path, specifying whether it supports chat_class_path, simple_class_path, or both.

Key considerations:

  • The model_id in @register_model must match the --model CLI argument
  • Add to AVAILABLE_CHAT_TEMPLATE_MODELS (chat) or AVAILABLE_SIMPLE_MODELS (simple) dictionaries
  • The ModelRegistryV2 also supports entry-point-based plugin registration for external packages
  • Aliases can be defined in the ModelManifest for alternative model names

Step 5: Example Script Creation

Create an example shell script in examples/models/ demonstrating how to run the model with typical evaluation tasks. The script should show representative --model_args, recommended --batch_size, and commonly used --tasks for the model's target modalities. Include comments explaining model-specific parameters.

Key considerations:

  • Show both minimal and full evaluation examples
  • Document any environment variables or prerequisites (API keys, model downloads)
  • Include recommended generation kwargs if they differ from defaults
  • Test the script end-to-end before submitting

Step 6: Testing and Validation

Validate the model integration by running evaluations on representative tasks. Test with a small --limit first, then verify results match expected baselines. Check that both generate_until and loglikelihood work correctly if the model supports both. Test with batch_size > 1 to verify batching logic, and in multi-GPU mode to verify distributed compatibility.

Key considerations:

  • Test with at least one generation task (e.g., mme) and one multiple-choice task (e.g., seedbench_ppl)
  • Use --log_samples to inspect model outputs for correctness
  • Compare results against published baselines for the model
  • Verify memory usage and throughput are reasonable

Execution Diagram

GitHub URL

Workflow Repository