Implementation:FlagOpen FlagEmbedding MLVU Evaluate Summary
| Knowledge Sources | |
|---|---|
| Domains | Video Understanding, Evaluation, Natural Language Processing |
| Last Updated | 2026-02-09 00:00 GMT |
Overview
GPT-4 based evaluation script for video summarization tasks in MLVU benchmark.
Description
This script evaluates video summarization model predictions using GPT-4 as an automated judge. It assesses summaries on two key dimensions: completeness (how well the summary covers key points from 1-5) and reliability (factual accuracy and clarity from 1-5). The evaluation compares predicted summaries against standard answers using detailed scoring rubrics.
The script processes prediction files through the OpenAI API with robust error handling, retry logic, and progress tracking. It handles batch processing with multiprocessing support and aggregates individual evaluations into a combined results file.
Usage
Use this script to automatically evaluate video summarization outputs from multimodal language models on the MLVU benchmark when manual scoring is impractical. Requires OpenAI API access and properly formatted prediction files.
Code Reference
Source Location
- Repository: FlagOpen_FlagEmbedding
- File: research/MLVU/evaluation/generation_evaluation/evaluate_summary.py
- Lines: 1-206
Signature
def annotate(prediction_set, caption_files, output_dir):
"""
Evaluates question and answer pairs using GPT-4
"""
def main():
"""
Main function to control the flow of the program.
"""
Import
import openai
import argparse
import json
from multiprocessing.pool import Pool
from tqdm import tqdm
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| pred_path | str | Yes | Path to prediction JSON file with model summaries |
| output_dir | str | Yes | Directory for saving individual evaluation JSON files |
| output_json | str | Yes | Path to save aggregated results |
| api_key | str | Yes | OpenAI API key for GPT-4 |
| num_tasks | int | No | Number of parallel processing splits (default: 1) |
Outputs
| Name | Type | Description |
|---|---|---|
| Individual JSON files | JSON | Per-summary evaluation with completeness and reliability scores |
| Combined JSON | JSON | Aggregated evaluation results with explanations |
Usage Examples
# Command line execution
python evaluate_summary.py \
--pred_path output_dir/qwen/pred_summary_all.json \
--output_dir output_dir/qwen_summary_all \
--output_json output_dir/qwen_summary_all_results.json \
--api_key YOUR_OPENAI_API_KEY \
--num_tasks 4
# Expected prediction format:
[
{
"video_name": "video1.mp4",
"Q": "Summarize the video",
"A": "Ground truth summary",
"pred": "Model generated summary"
}
]