Implementation:OpenGVLab InternVL POPE Benchmark Evaluation
| Knowledge Sources | |
|---|---|
| Domains | Evaluation, Benchmark, Hallucination_Detection |
| Last Updated | 2026-02-07 14:00 GMT |
Overview
This script evaluates model predictions on the POPE (Polling-based Object Probing Evaluation) benchmark by computing accuracy, precision, recall, F1 score, and yes-ratio metrics for object hallucination detection.
Description
The eval_pope.py script implements the evaluation logic for the POPE benchmark, which measures object hallucination in vision-language models. The evaluation works as follows:
- Answer normalization: Model free-text responses are parsed into binary yes/no predictions by checking for the presence of keywords ("No", "not", "no") in the first sentence
- Label loading: Ground truth labels ("yes"/"no") are loaded from annotation files in the specified directory
- Per-category evaluation: The script iterates over annotation files matching the pattern
coco_pope_*.json, filtering predictions by category (e.g., "random", "popular", "adversarial") - Metric computation: For each category, it computes TP, FP, TN, FN counts and derives accuracy, precision, recall, F1 score, and yes-ratio
The yes-ratio metric is particularly important for POPE as it reveals model bias toward answering "yes" regardless of the question, a common hallucination pattern.
Usage
Use this script to evaluate LLaVA model outputs on the POPE benchmark after generating predictions with a VQA inference script. It provides per-category hallucination metrics across random, popular, and adversarial sampling strategies.
Code Reference
Source Location
- Repository: OpenGVLab_InternVL
- File: internvl_chat_llava/llava/eval/eval_pope.py
- Lines: 1-81
Signature
def eval_pope(answers: list, label_file: str) -> None: ...
Import
# This is a standalone CLI script, not typically imported
# Run via: python eval_pope.py --annotation-dir /path/to/pope --question-file questions.jsonl --result-file results.jsonl
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| --annotation-dir | str (dir path) | Yes | Directory containing POPE annotation files (coco_pope_*.json) |
| --question-file | str (file path) | Yes | Path to JSONL file with questions (containing question_id and category) |
| --result-file | str (file path) | Yes | Path to JSONL file with model predictions (containing question_id and text) |
Outputs
| Name | Type | Description |
|---|---|---|
| stdout | text | Per-category TP/FP/TN/FN counts, accuracy, precision, recall, F1, and yes-ratio |
Usage Examples
Basic Usage
# Command-line execution for POPE evaluation
# python internvl_chat_llava/llava/eval/eval_pope.py \
# --annotation-dir /path/to/coco_pope_annotations \
# --question-file pope_questions.jsonl \
# --result-file model_predictions.jsonl