Implementation:OpenGVLab InternVL M4C Evaluator

Knowledge Sources	OpenGVLab_InternVL
Domains	VQA Evaluation, Text Recognition, Benchmark Metrics
Last Updated	2026-02-07 14:00 GMT

Overview

Evaluation metric implementations for TextVQA, STVQA, and TextCaps benchmarks, providing answer normalization, soft accuracy scoring, ANLS (Average Normalized Levenshtein Similarity), and BLEU-4 evaluation, ported from Facebook's MMF framework.

Description

This module provides the evaluation infrastructure for text-centric visual question answering and captioning benchmarks:

EvalAIAnswerProcessor: The core answer normalization pipeline that processes predicted and ground-truth answers through: word tokenization (lowercasing, comma/question mark removal), punctuation processing (context-aware removal or spacing of 20+ punctuation characters), digit-article processing (number word to digit conversion, article removal), and contraction expansion (mapping 130+ contractions to their full forms).

TextVQAAccuracyEvaluator: Computes soft accuracy using the standard VQA evaluation protocol where each answer is scored against 10 human reference answers. For each unique answer, the accuracy is min(1, matching_count / 3), following the standard "at least 3 annotators agree" convention.

STVQAAccuracyEvaluator: Computes exact match accuracy where a predicted answer scores 1.0 if it matches any ground-truth answer after normalization.

STVQAANLSEvaluator: Computes the Average Normalized Levenshtein Similarity (ANLS) metric using edit distance, with a threshold of 0.5 (answers below this similarity are scored as 0.0).

TextCapsBleu4Evaluator: Computes BLEU-4 score for caption evaluation using the pycocoevalcap toolkit with PTBTokenizer preprocessing.

Usage

Use these evaluators in the LLaVA evaluation pipeline to compute standard benchmark metrics. Each evaluator's eval_pred_list method accepts a list of prediction dictionaries with "pred_answer" and "gt_answers" keys and returns an aggregate score.

Code Reference

Source Location

Repository: OpenGVLab_InternVL
File: internvl_chat_llava/llava/eval/m4c_evaluator.py
Lines: 1-334

Signature

class EvalAIAnswerProcessor:
    def __call__(self, item: str) -> str: ...
    def word_tokenize(self, word: str) -> str: ...
    def process_punctuation(self, in_text: str) -> str: ...
    def process_digit_article(self, in_text: str) -> str: ...

class TextVQAAccuracyEvaluator:
    def eval_pred_list(self, pred_list: list) -> float: ...

class STVQAAccuracyEvaluator:
    def eval_pred_list(self, pred_list: list) -> float: ...

class STVQAANLSEvaluator:
    def get_anls(self, s1: str, s2: str) -> float: ...
    def eval_pred_list(self, pred_list: list) -> float: ...

class TextCapsBleu4Evaluator:
    def eval_pred_list(self, pred_list: list) -> float: ...

Import

from llava.eval.m4c_evaluator import TextVQAAccuracyEvaluator, EvalAIAnswerProcessor

I/O Contract

Inputs

Name	Type	Required	Description
pred_list	List[dict]	Yes	List of prediction dicts with "pred_answer" (str) and "gt_answers" (List[str]) keys
item	str	Yes	Raw answer string to normalize (for EvalAIAnswerProcessor)

Outputs

Name	Type	Description
accuracy	float	Aggregate accuracy/ANLS/BLEU-4 score from eval_pred_list
normalized_answer	str	Normalized answer string from EvalAIAnswerProcessor

Usage Examples

Basic Usage

from llava.eval.m4c_evaluator import TextVQAAccuracyEvaluator

evaluator = TextVQAAccuracyEvaluator()

pred_list = [
    {
        "pred_answer": "a cat",
        "gt_answers": ["cat", "a cat", "cat", "cats", "cat",
                       "a cat", "cat", "cat", "kitty", "cat"]
    }
]

accuracy = evaluator.eval_pred_list(pred_list)
print(f"TextVQA Accuracy: {accuracy:.4f}")

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment