Implementation:OpenGVLab InternVL Evaluate Chat Model

Knowledge Sources	InternVL
Domains	Evaluation, Distributed_Computing
Last Updated	2026-02-07 00:00 GMT

Overview

Concrete tool for running distributed VQA evaluation using InternVL's chat interface provided by the evaluation framework.

Description

The evaluate_chat_model function in evaluate_vqa.py runs distributed inference across 17+ VQA benchmarks. For each benchmark, it:

Loads the dataset into a VQADataset with the appropriate prompt template
Distributes samples across ranks via InferenceSampler
Calls model.chat() for each sample to generate predictions
Gathers predictions and computes the benchmark-specific metric (vqa_score, anls, relaxed_accuracy, etc.)

Similar functions exist in evaluate_mantis.py (multi-image), evaluate_mmhal.py (hallucination), and evaluate_mmiu.py (multi-image understanding).

Usage

Called automatically by the evaluation scripts when launched via torchrun or evaluate.sh.

Code Reference

Source Location

Repository: InternVL
File: internvl_chat/eval/vqa/evaluate_vqa.py
Lines: L318-487

Signature

def evaluate_chat_model():
    """
    Main evaluation loop for VQA benchmarks.

    Reads configuration from command-line args:
        --checkpoint: Model path
        --datasets: Comma-separated benchmark names
        --dynamic: Enable dynamic image resolution
        --max-num: Maximum tile count for dynamic resolution
        --few-shot: Number of few-shot examples (default 0)

    Internally:
        1. Iterates over requested datasets from ds_collections
        2. Creates VQADataset with appropriate prompt
        3. Distributes via InferenceSampler + DataLoader
        4. Calls model.chat() per sample
        5. Gathers via all_gather_object
        6. Computes metric (vqa_score, anls, relaxed_accuracy, exact_match)
    """

Import

# Launched via torchrun:
torchrun --nproc_per_node=8 eval/vqa/evaluate_vqa.py \
    --checkpoint ./output/finetune \
    --datasets vqa-textvqa-val \
    --dynamic

I/O Contract

Inputs

Name	Type	Required	Description
--checkpoint	str	Yes	Path to model checkpoint
--datasets	str	Yes	Comma-separated benchmark names from ds_collections
--dynamic	flag	No	Enable dynamic resolution
--max-num	int	No	Maximum tiles for dynamic resolution (default 6)

Outputs

Name	Type	Description
Results JSON	File	Per-sample predictions saved to checkpoint directory
Metrics	stdout	Benchmark score (accuracy, ANLS, etc.) printed to console

Usage Examples

Evaluate Multiple VQA Benchmarks

torchrun --nproc_per_node=8 eval/vqa/evaluate_vqa.py \
    --checkpoint ./output/finetune \
    --datasets vqa-textvqa-val,vqa-docvqa-val,vqa-chartqa-test \
    --dynamic \
    --max-num 12

Related Pages

Implements Principle

Principle:OpenGVLab_InternVL_Distributed_Evaluation

Requires Environment

Environment:OpenGVLab_InternVL_PyTorch_CUDA

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment