Implementation:OpenGVLab InternVL Evaluate Chat Model
Appearance
| Knowledge Sources | |
|---|---|
| Domains | Evaluation, Distributed_Computing |
| Last Updated | 2026-02-07 00:00 GMT |
Overview
Concrete tool for running distributed VQA evaluation using InternVL's chat interface provided by the evaluation framework.
Description
The evaluate_chat_model function in evaluate_vqa.py runs distributed inference across 17+ VQA benchmarks. For each benchmark, it:
- Loads the dataset into a VQADataset with the appropriate prompt template
- Distributes samples across ranks via InferenceSampler
- Calls model.chat() for each sample to generate predictions
- Gathers predictions and computes the benchmark-specific metric (vqa_score, anls, relaxed_accuracy, etc.)
Similar functions exist in evaluate_mantis.py (multi-image), evaluate_mmhal.py (hallucination), and evaluate_mmiu.py (multi-image understanding).
Usage
Called automatically by the evaluation scripts when launched via torchrun or evaluate.sh.
Code Reference
Source Location
- Repository: InternVL
- File: internvl_chat/eval/vqa/evaluate_vqa.py
- Lines: L318-487
Signature
def evaluate_chat_model():
"""
Main evaluation loop for VQA benchmarks.
Reads configuration from command-line args:
--checkpoint: Model path
--datasets: Comma-separated benchmark names
--dynamic: Enable dynamic image resolution
--max-num: Maximum tile count for dynamic resolution
--few-shot: Number of few-shot examples (default 0)
Internally:
1. Iterates over requested datasets from ds_collections
2. Creates VQADataset with appropriate prompt
3. Distributes via InferenceSampler + DataLoader
4. Calls model.chat() per sample
5. Gathers via all_gather_object
6. Computes metric (vqa_score, anls, relaxed_accuracy, exact_match)
"""
Import
# Launched via torchrun:
torchrun --nproc_per_node=8 eval/vqa/evaluate_vqa.py \
--checkpoint ./output/finetune \
--datasets vqa-textvqa-val \
--dynamic
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| --checkpoint | str | Yes | Path to model checkpoint |
| --datasets | str | Yes | Comma-separated benchmark names from ds_collections |
| --dynamic | flag | No | Enable dynamic resolution |
| --max-num | int | No | Maximum tiles for dynamic resolution (default 6) |
Outputs
| Name | Type | Description |
|---|---|---|
| Results JSON | File | Per-sample predictions saved to checkpoint directory |
| Metrics | stdout | Benchmark score (accuracy, ANLS, etc.) printed to console |
Usage Examples
Evaluate Multiple VQA Benchmarks
torchrun --nproc_per_node=8 eval/vqa/evaluate_vqa.py \
--checkpoint ./output/finetune \
--datasets vqa-textvqa-val,vqa-docvqa-val,vqa-chartqa-test \
--dynamic \
--max-num 12
Related Pages
Implements Principle
Requires Environment
Page Connections
Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment