Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:FlagOpen FlagEmbedding MLVU Evaluate SSC

From Leeroopedia


Knowledge Sources
Domains Video Understanding, Evaluation, Natural Language Processing
Last Updated 2026-02-09 00:00 GMT

Overview

GPT-4 based evaluation script for sub-scene captioning (SSC) tasks in MLVU video understanding benchmark.

Description

This script evaluates model predictions for the sub-scene captioning task using GPT-4 as an automated evaluator. It compares predicted answers against scoring points and ground truth answers, providing accuracy and relevance scores on a 1-5 scale. The evaluation is performed by sending prompts to the GPT-4 API with detailed scoring criteria, then aggregating results across all predictions.

The script handles large-scale evaluation by supporting parallel processing, retry logic for API failures, and progress tracking. It processes prediction files, evaluates each question-answer pair, and combines results into a final JSON output.

Usage

Use this script to automatically evaluate video understanding model outputs on MLVU's sub-scene captioning task, particularly when manual evaluation is impractical due to scale. It requires an OpenAI API key and prediction files in the expected JSON format.

Code Reference

Source Location

Signature

def annotate(prediction_set, caption_files, output_dir):
    """
    Evaluates question and answer pairs using GPT-4
    Returns a score for correctness.
    """

def main():
    """
    Main function to control the flow of the program.
    """

Import

import openai
import argparse
import json
from multiprocessing.pool import Pool
from tqdm import tqdm

I/O Contract

Inputs

Name Type Required Description
pred_path str Yes Path to prediction JSON file containing model outputs
output_dir str Yes Directory to save individual evaluation results
output_json str Yes Path to save final combined results JSON file
api_key str Yes OpenAI API key for GPT-4 access
num_tasks int No Number of parallel task splits (default: 1)

Outputs

Name Type Description
Individual JSON files JSON Per-prediction evaluation with explanation and scores
Combined JSON JSON Aggregated results from all evaluations with scoring details

Usage Examples

# Command line usage
python evaluate_ssc.py \
    --pred_path output_dir/qwen/pred_subPlot_all.json \
    --output_dir output_dir/qwen_subplot_all \
    --output_json output_dir/qwen_subplot_all_results.json \
    --api_key YOUR_API_KEY \
    --num_tasks 4

# The prediction file should have this format:
[
    {
        "video_name": "video1.mp4",
        "Q": "Describe the scene",
        "A": "Ground truth answer",
        "pred": "Model prediction"
    }
]

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment