Workflow:OpenGVLab InternVL Benchmark Evaluation
| Knowledge Sources | |
|---|---|
| Domains | VLMs, Evaluation, Benchmarking, Multimodal |
| Last Updated | 2026-02-07 14:00 GMT |
Overview
End-to-end process for evaluating InternVL models across multiple multimodal benchmarks including VQA, hallucination detection, and multi-image understanding.
Description
This workflow covers the evaluation of InternVL models on a comprehensive set of multimodal benchmarks. The evaluation suite includes Visual Question Answering (TextVQA, VQAv2, GQA, OKVQA, VizWiz, InfographicsVQA, DocVQA, ChartQA, AI2D), hallucination benchmarks (MMHal-Bench), multi-image understanding (Mantis-Eval, MMIU), and visual reasoning (MM-Vet). A central dispatcher script routes benchmark names to their corresponding evaluation modules, handling model loading, distributed inference, and result collection.
Usage
Execute this workflow after training or fine-tuning an InternVL model to measure its performance on standard benchmarks. This is essential for comparing against published baselines, validating that fine-tuning has not degraded general capabilities, and identifying specific strengths or weaknesses of the model.
Execution Steps
Step 1: Prepare Benchmark Datasets
Download and organize the evaluation datasets for each target benchmark. Each benchmark has specific data format requirements and evaluation protocols. The datasets should be placed in the expected directory structure referenced by the evaluation scripts.
Key considerations:
- Each benchmark requires its own dataset download (images + questions)
- VQA datasets typically provide question JSON files and image directories
- Multi-image benchmarks (Mantis, MMIU) require specific multi-image data layouts
- The eval README documents the expected data paths for each benchmark
Step 2: Select Benchmarks
Choose which benchmarks to run from the supported suite. The master dispatcher script accepts benchmark names as arguments and routes to the appropriate evaluation module. Benchmarks can be run individually or as a batch.
Supported benchmark categories:
- VQA: TextVQA, VQAv2, GQA, OKVQA, VizWiz, InfographicsVQA, DocVQA, ChartQA, AI2D
- Hallucination: MMHal-Bench
- Multi-image: Mantis-Eval, MMIU
- Visual reasoning: MM-Vet
Step 3: Load Model for Inference
Load the InternVL model checkpoint for evaluation. The model is loaded in evaluation mode with the same configuration used during training. For distributed evaluation, the model is partitioned across available GPUs using device mapping.
Key considerations:
- Model is loaded with torch_dtype=bfloat16 for efficient inference
- Multi-GPU device mapping is used for large models (26B+)
- The model loading utility in the model package handles device placement automatically
- Flash Attention is enabled for efficient attention computation during inference
Step 4: Run Distributed Inference
Execute evaluation inference across GPUs. Each evaluation script handles question formatting, image preprocessing with dynamic resolution, generation with configurable decoding parameters, and output collection. Results are gathered from all processes for scoring.
Key considerations:
- Most benchmarks use distributed inference across multiple GPUs for speed
- Dynamic image resolution (1-12 patches of 448x448) is applied to evaluation images
- Generation parameters (temperature, top_p, max_new_tokens) vary by benchmark
- The conversation template must match the LLM backbone used during training
- Results are saved as JSONL files for subsequent scoring
Step 5: Score and Report Results
Apply benchmark-specific scoring metrics to the generated outputs. Each benchmark has its own evaluation protocol: VQA accuracy for VQA tasks, GPT-based scoring for MM-Vet, and specialized metrics for others. Results are reported in the standard format for each benchmark.
Key considerations:
- VQA benchmarks use standard VQA accuracy (relaxed matching with string normalization)
- MM-Vet requires GPT-based evaluation for open-ended responses
- InfographicsVQA and DocVQA use ANLS (Average Normalized Levenshtein Similarity)
- Some benchmarks require converting outputs to a submission format for external evaluation servers
- Compare results against published baselines for the corresponding model size