Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:OpenGVLab InternVL Evaluate Chat Model

From Leeroopedia


Knowledge Sources
Domains Evaluation, Distributed_Computing
Last Updated 2026-02-07 00:00 GMT

Overview

Concrete tool for running distributed VQA evaluation using InternVL's chat interface provided by the evaluation framework.

Description

The evaluate_chat_model function in evaluate_vqa.py runs distributed inference across 17+ VQA benchmarks. For each benchmark, it:

  1. Loads the dataset into a VQADataset with the appropriate prompt template
  2. Distributes samples across ranks via InferenceSampler
  3. Calls model.chat() for each sample to generate predictions
  4. Gathers predictions and computes the benchmark-specific metric (vqa_score, anls, relaxed_accuracy, etc.)

Similar functions exist in evaluate_mantis.py (multi-image), evaluate_mmhal.py (hallucination), and evaluate_mmiu.py (multi-image understanding).

Usage

Called automatically by the evaluation scripts when launched via torchrun or evaluate.sh.

Code Reference

Source Location

  • Repository: InternVL
  • File: internvl_chat/eval/vqa/evaluate_vqa.py
  • Lines: L318-487

Signature

def evaluate_chat_model():
    """
    Main evaluation loop for VQA benchmarks.

    Reads configuration from command-line args:
        --checkpoint: Model path
        --datasets: Comma-separated benchmark names
        --dynamic: Enable dynamic image resolution
        --max-num: Maximum tile count for dynamic resolution
        --few-shot: Number of few-shot examples (default 0)

    Internally:
        1. Iterates over requested datasets from ds_collections
        2. Creates VQADataset with appropriate prompt
        3. Distributes via InferenceSampler + DataLoader
        4. Calls model.chat() per sample
        5. Gathers via all_gather_object
        6. Computes metric (vqa_score, anls, relaxed_accuracy, exact_match)
    """

Import

# Launched via torchrun:
torchrun --nproc_per_node=8 eval/vqa/evaluate_vqa.py \
    --checkpoint ./output/finetune \
    --datasets vqa-textvqa-val \
    --dynamic

I/O Contract

Inputs

Name Type Required Description
--checkpoint str Yes Path to model checkpoint
--datasets str Yes Comma-separated benchmark names from ds_collections
--dynamic flag No Enable dynamic resolution
--max-num int No Maximum tiles for dynamic resolution (default 6)

Outputs

Name Type Description
Results JSON File Per-sample predictions saved to checkpoint directory
Metrics stdout Benchmark score (accuracy, ANLS, etc.) printed to console

Usage Examples

Evaluate Multiple VQA Benchmarks

torchrun --nproc_per_node=8 eval/vqa/evaluate_vqa.py \
    --checkpoint ./output/finetune \
    --datasets vqa-textvqa-val,vqa-docvqa-val,vqa-chartqa-test \
    --dynamic \
    --max-num 12

Related Pages

Implements Principle

Requires Environment

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment