Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:OpenGVLab InternVL POPE Benchmark Evaluation

From Leeroopedia


Knowledge Sources
Domains Evaluation, Benchmark, Hallucination_Detection
Last Updated 2026-02-07 14:00 GMT

Overview

This script evaluates model predictions on the POPE (Polling-based Object Probing Evaluation) benchmark by computing accuracy, precision, recall, F1 score, and yes-ratio metrics for object hallucination detection.

Description

The eval_pope.py script implements the evaluation logic for the POPE benchmark, which measures object hallucination in vision-language models. The evaluation works as follows:

  1. Answer normalization: Model free-text responses are parsed into binary yes/no predictions by checking for the presence of keywords ("No", "not", "no") in the first sentence
  2. Label loading: Ground truth labels ("yes"/"no") are loaded from annotation files in the specified directory
  3. Per-category evaluation: The script iterates over annotation files matching the pattern coco_pope_*.json, filtering predictions by category (e.g., "random", "popular", "adversarial")
  4. Metric computation: For each category, it computes TP, FP, TN, FN counts and derives accuracy, precision, recall, F1 score, and yes-ratio

The yes-ratio metric is particularly important for POPE as it reveals model bias toward answering "yes" regardless of the question, a common hallucination pattern.

Usage

Use this script to evaluate LLaVA model outputs on the POPE benchmark after generating predictions with a VQA inference script. It provides per-category hallucination metrics across random, popular, and adversarial sampling strategies.

Code Reference

Source Location

Signature

def eval_pope(answers: list, label_file: str) -> None: ...

Import

# This is a standalone CLI script, not typically imported
# Run via: python eval_pope.py --annotation-dir /path/to/pope --question-file questions.jsonl --result-file results.jsonl

I/O Contract

Inputs

Name Type Required Description
--annotation-dir str (dir path) Yes Directory containing POPE annotation files (coco_pope_*.json)
--question-file str (file path) Yes Path to JSONL file with questions (containing question_id and category)
--result-file str (file path) Yes Path to JSONL file with model predictions (containing question_id and text)

Outputs

Name Type Description
stdout text Per-category TP/FP/TN/FN counts, accuracy, precision, recall, F1, and yes-ratio

Usage Examples

Basic Usage

# Command-line execution for POPE evaluation
# python internvl_chat_llava/llava/eval/eval_pope.py \
#     --annotation-dir /path/to/coco_pope_annotations \
#     --question-file pope_questions.jsonl \
#     --result-file model_predictions.jsonl

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment