Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:OpenGVLab InternVL M4C Evaluator

From Leeroopedia


Knowledge Sources
Domains VQA Evaluation, Text Recognition, Benchmark Metrics
Last Updated 2026-02-07 14:00 GMT

Overview

Evaluation metric implementations for TextVQA, STVQA, and TextCaps benchmarks, providing answer normalization, soft accuracy scoring, ANLS (Average Normalized Levenshtein Similarity), and BLEU-4 evaluation, ported from Facebook's MMF framework.

Description

This module provides the evaluation infrastructure for text-centric visual question answering and captioning benchmarks:

EvalAIAnswerProcessor: The core answer normalization pipeline that processes predicted and ground-truth answers through: word tokenization (lowercasing, comma/question mark removal), punctuation processing (context-aware removal or spacing of 20+ punctuation characters), digit-article processing (number word to digit conversion, article removal), and contraction expansion (mapping 130+ contractions to their full forms).

TextVQAAccuracyEvaluator: Computes soft accuracy using the standard VQA evaluation protocol where each answer is scored against 10 human reference answers. For each unique answer, the accuracy is min(1, matching_count / 3), following the standard "at least 3 annotators agree" convention.

STVQAAccuracyEvaluator: Computes exact match accuracy where a predicted answer scores 1.0 if it matches any ground-truth answer after normalization.

STVQAANLSEvaluator: Computes the Average Normalized Levenshtein Similarity (ANLS) metric using edit distance, with a threshold of 0.5 (answers below this similarity are scored as 0.0).

TextCapsBleu4Evaluator: Computes BLEU-4 score for caption evaluation using the pycocoevalcap toolkit with PTBTokenizer preprocessing.

Usage

Use these evaluators in the LLaVA evaluation pipeline to compute standard benchmark metrics. Each evaluator's eval_pred_list method accepts a list of prediction dictionaries with "pred_answer" and "gt_answers" keys and returns an aggregate score.

Code Reference

Source Location

Signature

class EvalAIAnswerProcessor:
    def __call__(self, item: str) -> str: ...
    def word_tokenize(self, word: str) -> str: ...
    def process_punctuation(self, in_text: str) -> str: ...
    def process_digit_article(self, in_text: str) -> str: ...

class TextVQAAccuracyEvaluator:
    def eval_pred_list(self, pred_list: list) -> float: ...

class STVQAAccuracyEvaluator:
    def eval_pred_list(self, pred_list: list) -> float: ...

class STVQAANLSEvaluator:
    def get_anls(self, s1: str, s2: str) -> float: ...
    def eval_pred_list(self, pred_list: list) -> float: ...

class TextCapsBleu4Evaluator:
    def eval_pred_list(self, pred_list: list) -> float: ...

Import

from llava.eval.m4c_evaluator import TextVQAAccuracyEvaluator, EvalAIAnswerProcessor

I/O Contract

Inputs

Name Type Required Description
pred_list List[dict] Yes List of prediction dicts with "pred_answer" (str) and "gt_answers" (List[str]) keys
item str Yes Raw answer string to normalize (for EvalAIAnswerProcessor)

Outputs

Name Type Description
accuracy float Aggregate accuracy/ANLS/BLEU-4 score from eval_pred_list
normalized_answer str Normalized answer string from EvalAIAnswerProcessor

Usage Examples

Basic Usage

from llava.eval.m4c_evaluator import TextVQAAccuracyEvaluator

evaluator = TextVQAAccuracyEvaluator()

pred_list = [
    {
        "pred_answer": "a cat",
        "gt_answers": ["cat", "a cat", "cat", "cats", "cat",
                       "a cat", "cat", "cat", "kitty", "cat"]
    }
]

accuracy = evaluator.eval_pred_list(pred_list)
print(f"TextVQA Accuracy: {accuracy:.4f}")

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment