Implementation:FlagOpen FlagEmbedding MLVU Evaluate SSC
| Knowledge Sources | |
|---|---|
| Domains | Video Understanding, Evaluation, Natural Language Processing |
| Last Updated | 2026-02-09 00:00 GMT |
Overview
GPT-4 based evaluation script for sub-scene captioning (SSC) tasks in MLVU video understanding benchmark.
Description
This script evaluates model predictions for the sub-scene captioning task using GPT-4 as an automated evaluator. It compares predicted answers against scoring points and ground truth answers, providing accuracy and relevance scores on a 1-5 scale. The evaluation is performed by sending prompts to the GPT-4 API with detailed scoring criteria, then aggregating results across all predictions.
The script handles large-scale evaluation by supporting parallel processing, retry logic for API failures, and progress tracking. It processes prediction files, evaluates each question-answer pair, and combines results into a final JSON output.
Usage
Use this script to automatically evaluate video understanding model outputs on MLVU's sub-scene captioning task, particularly when manual evaluation is impractical due to scale. It requires an OpenAI API key and prediction files in the expected JSON format.
Code Reference
Source Location
- Repository: FlagOpen_FlagEmbedding
- File: research/MLVU/evaluation/generation_evaluation/evaluate_ssc.py
- Lines: 1-237
Signature
def annotate(prediction_set, caption_files, output_dir):
"""
Evaluates question and answer pairs using GPT-4
Returns a score for correctness.
"""
def main():
"""
Main function to control the flow of the program.
"""
Import
import openai
import argparse
import json
from multiprocessing.pool import Pool
from tqdm import tqdm
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| pred_path | str | Yes | Path to prediction JSON file containing model outputs |
| output_dir | str | Yes | Directory to save individual evaluation results |
| output_json | str | Yes | Path to save final combined results JSON file |
| api_key | str | Yes | OpenAI API key for GPT-4 access |
| num_tasks | int | No | Number of parallel task splits (default: 1) |
Outputs
| Name | Type | Description |
|---|---|---|
| Individual JSON files | JSON | Per-prediction evaluation with explanation and scores |
| Combined JSON | JSON | Aggregated results from all evaluations with scoring details |
Usage Examples
# Command line usage
python evaluate_ssc.py \
--pred_path output_dir/qwen/pred_subPlot_all.json \
--output_dir output_dir/qwen_subplot_all \
--output_json output_dir/qwen_subplot_all_results.json \
--api_key YOUR_API_KEY \
--num_tasks 4
# The prediction file should have this format:
[
{
"video_name": "video1.mp4",
"Q": "Describe the scene",
"A": "Ground truth answer",
"pred": "Model prediction"
}
]