Implementation:FlagOpen FlagEmbedding MLVU Evaluate SSC

Knowledge Sources	FlagOpen_FlagEmbedding
Domains	Video Understanding, Evaluation, Natural Language Processing
Last Updated	2026-02-09 00:00 GMT

Overview

GPT-4 based evaluation script for sub-scene captioning (SSC) tasks in MLVU video understanding benchmark.

Description

This script evaluates model predictions for the sub-scene captioning task using GPT-4 as an automated evaluator. It compares predicted answers against scoring points and ground truth answers, providing accuracy and relevance scores on a 1-5 scale. The evaluation is performed by sending prompts to the GPT-4 API with detailed scoring criteria, then aggregating results across all predictions.

The script handles large-scale evaluation by supporting parallel processing, retry logic for API failures, and progress tracking. It processes prediction files, evaluates each question-answer pair, and combines results into a final JSON output.

Usage

Use this script to automatically evaluate video understanding model outputs on MLVU's sub-scene captioning task, particularly when manual evaluation is impractical due to scale. It requires an OpenAI API key and prediction files in the expected JSON format.

Code Reference

Source Location

Repository: FlagOpen_FlagEmbedding
File: research/MLVU/evaluation/generation_evaluation/evaluate_ssc.py
Lines: 1-237

Signature

def annotate(prediction_set, caption_files, output_dir):
    """
    Evaluates question and answer pairs using GPT-4
    Returns a score for correctness.
    """

def main():
    """
    Main function to control the flow of the program.
    """

Import

import openai
import argparse
import json
from multiprocessing.pool import Pool
from tqdm import tqdm

I/O Contract

Inputs

Name	Type	Required	Description
pred_path	str	Yes	Path to prediction JSON file containing model outputs
output_dir	str	Yes	Directory to save individual evaluation results
output_json	str	Yes	Path to save final combined results JSON file
api_key	str	Yes	OpenAI API key for GPT-4 access
num_tasks	int	No	Number of parallel task splits (default: 1)

Outputs

Name	Type	Description
Individual JSON files	JSON	Per-prediction evaluation with explanation and scores
Combined JSON	JSON	Aggregated results from all evaluations with scoring details

Usage Examples

# Command line usage
python evaluate_ssc.py \
    --pred_path output_dir/qwen/pred_subPlot_all.json \
    --output_dir output_dir/qwen_subplot_all \
    --output_json output_dir/qwen_subplot_all_results.json \
    --api_key YOUR_API_KEY \
    --num_tasks 4

# The prediction file should have this format:
[
    {
        "video_name": "video1.mp4",
        "Q": "Describe the scene",
        "A": "Ground truth answer",
        "pred": "Model prediction"
    }
]

Related Pages

Principle:FlagOpen_FlagEmbedding_Long_Video_Understanding_Evaluation

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment