Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Workflow:OpenGVLab InternVL Benchmark Evaluation

From Leeroopedia


Knowledge Sources
Domains VLMs, Evaluation, Benchmarking, Multimodal
Last Updated 2026-02-07 14:00 GMT

Overview

End-to-end process for evaluating InternVL models across multiple multimodal benchmarks including VQA, hallucination detection, and multi-image understanding.

Description

This workflow covers the evaluation of InternVL models on a comprehensive set of multimodal benchmarks. The evaluation suite includes Visual Question Answering (TextVQA, VQAv2, GQA, OKVQA, VizWiz, InfographicsVQA, DocVQA, ChartQA, AI2D), hallucination benchmarks (MMHal-Bench), multi-image understanding (Mantis-Eval, MMIU), and visual reasoning (MM-Vet). A central dispatcher script routes benchmark names to their corresponding evaluation modules, handling model loading, distributed inference, and result collection.

Usage

Execute this workflow after training or fine-tuning an InternVL model to measure its performance on standard benchmarks. This is essential for comparing against published baselines, validating that fine-tuning has not degraded general capabilities, and identifying specific strengths or weaknesses of the model.

Execution Steps

Step 1: Prepare Benchmark Datasets

Download and organize the evaluation datasets for each target benchmark. Each benchmark has specific data format requirements and evaluation protocols. The datasets should be placed in the expected directory structure referenced by the evaluation scripts.

Key considerations:

  • Each benchmark requires its own dataset download (images + questions)
  • VQA datasets typically provide question JSON files and image directories
  • Multi-image benchmarks (Mantis, MMIU) require specific multi-image data layouts
  • The eval README documents the expected data paths for each benchmark

Step 2: Select Benchmarks

Choose which benchmarks to run from the supported suite. The master dispatcher script accepts benchmark names as arguments and routes to the appropriate evaluation module. Benchmarks can be run individually or as a batch.

Supported benchmark categories:

  • VQA: TextVQA, VQAv2, GQA, OKVQA, VizWiz, InfographicsVQA, DocVQA, ChartQA, AI2D
  • Hallucination: MMHal-Bench
  • Multi-image: Mantis-Eval, MMIU
  • Visual reasoning: MM-Vet

Step 3: Load Model for Inference

Load the InternVL model checkpoint for evaluation. The model is loaded in evaluation mode with the same configuration used during training. For distributed evaluation, the model is partitioned across available GPUs using device mapping.

Key considerations:

  • Model is loaded with torch_dtype=bfloat16 for efficient inference
  • Multi-GPU device mapping is used for large models (26B+)
  • The model loading utility in the model package handles device placement automatically
  • Flash Attention is enabled for efficient attention computation during inference

Step 4: Run Distributed Inference

Execute evaluation inference across GPUs. Each evaluation script handles question formatting, image preprocessing with dynamic resolution, generation with configurable decoding parameters, and output collection. Results are gathered from all processes for scoring.

Key considerations:

  • Most benchmarks use distributed inference across multiple GPUs for speed
  • Dynamic image resolution (1-12 patches of 448x448) is applied to evaluation images
  • Generation parameters (temperature, top_p, max_new_tokens) vary by benchmark
  • The conversation template must match the LLM backbone used during training
  • Results are saved as JSONL files for subsequent scoring

Step 5: Score and Report Results

Apply benchmark-specific scoring metrics to the generated outputs. Each benchmark has its own evaluation protocol: VQA accuracy for VQA tasks, GPT-based scoring for MM-Vet, and specialized metrics for others. Results are reported in the standard format for each benchmark.

Key considerations:

  • VQA benchmarks use standard VQA accuracy (relaxed matching with string normalization)
  • MM-Vet requires GPT-based evaluation for open-ended responses
  • InfographicsVQA and DocVQA use ANLS (Average Normalized Levenshtein Similarity)
  • Some benchmarks require converting outputs to a submission format for external evaluation servers
  • Compare results against published baselines for the corresponding model size

Execution Diagram

GitHub URL

Workflow Repository