Implementation:OpenGVLab InternVL M4C Evaluator
| Knowledge Sources | |
|---|---|
| Domains | VQA Evaluation, Text Recognition, Benchmark Metrics |
| Last Updated | 2026-02-07 14:00 GMT |
Overview
Evaluation metric implementations for TextVQA, STVQA, and TextCaps benchmarks, providing answer normalization, soft accuracy scoring, ANLS (Average Normalized Levenshtein Similarity), and BLEU-4 evaluation, ported from Facebook's MMF framework.
Description
This module provides the evaluation infrastructure for text-centric visual question answering and captioning benchmarks:
EvalAIAnswerProcessor: The core answer normalization pipeline that processes predicted and ground-truth answers through: word tokenization (lowercasing, comma/question mark removal), punctuation processing (context-aware removal or spacing of 20+ punctuation characters), digit-article processing (number word to digit conversion, article removal), and contraction expansion (mapping 130+ contractions to their full forms).
TextVQAAccuracyEvaluator: Computes soft accuracy using the standard VQA evaluation protocol where each answer is scored against 10 human reference answers. For each unique answer, the accuracy is min(1, matching_count / 3), following the standard "at least 3 annotators agree" convention.
STVQAAccuracyEvaluator: Computes exact match accuracy where a predicted answer scores 1.0 if it matches any ground-truth answer after normalization.
STVQAANLSEvaluator: Computes the Average Normalized Levenshtein Similarity (ANLS) metric using edit distance, with a threshold of 0.5 (answers below this similarity are scored as 0.0).
TextCapsBleu4Evaluator: Computes BLEU-4 score for caption evaluation using the pycocoevalcap toolkit with PTBTokenizer preprocessing.
Usage
Use these evaluators in the LLaVA evaluation pipeline to compute standard benchmark metrics. Each evaluator's eval_pred_list method accepts a list of prediction dictionaries with "pred_answer" and "gt_answers" keys and returns an aggregate score.
Code Reference
Source Location
- Repository: OpenGVLab_InternVL
- File: internvl_chat_llava/llava/eval/m4c_evaluator.py
- Lines: 1-334
Signature
class EvalAIAnswerProcessor:
def __call__(self, item: str) -> str: ...
def word_tokenize(self, word: str) -> str: ...
def process_punctuation(self, in_text: str) -> str: ...
def process_digit_article(self, in_text: str) -> str: ...
class TextVQAAccuracyEvaluator:
def eval_pred_list(self, pred_list: list) -> float: ...
class STVQAAccuracyEvaluator:
def eval_pred_list(self, pred_list: list) -> float: ...
class STVQAANLSEvaluator:
def get_anls(self, s1: str, s2: str) -> float: ...
def eval_pred_list(self, pred_list: list) -> float: ...
class TextCapsBleu4Evaluator:
def eval_pred_list(self, pred_list: list) -> float: ...
Import
from llava.eval.m4c_evaluator import TextVQAAccuracyEvaluator, EvalAIAnswerProcessor
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| pred_list | List[dict] | Yes | List of prediction dicts with "pred_answer" (str) and "gt_answers" (List[str]) keys |
| item | str | Yes | Raw answer string to normalize (for EvalAIAnswerProcessor) |
Outputs
| Name | Type | Description |
|---|---|---|
| accuracy | float | Aggregate accuracy/ANLS/BLEU-4 score from eval_pred_list |
| normalized_answer | str | Normalized answer string from EvalAIAnswerProcessor |
Usage Examples
Basic Usage
from llava.eval.m4c_evaluator import TextVQAAccuracyEvaluator
evaluator = TextVQAAccuracyEvaluator()
pred_list = [
{
"pred_answer": "a cat",
"gt_answers": ["cat", "a cat", "cat", "cats", "cat",
"a cat", "cat", "cat", "kitty", "cat"]
}
]
accuracy = evaluator.eval_pred_list(pred_list)
print(f"TextVQA Accuracy: {accuracy:.4f}")