Implementation:FlagOpen FlagEmbedding MLVU Evaluate Summary

Knowledge Sources	FlagOpen_FlagEmbedding
Domains	Video Understanding, Evaluation, Natural Language Processing
Last Updated	2026-02-09 00:00 GMT

Overview

GPT-4 based evaluation script for video summarization tasks in MLVU benchmark.

Description

This script evaluates video summarization model predictions using GPT-4 as an automated judge. It assesses summaries on two key dimensions: completeness (how well the summary covers key points from 1-5) and reliability (factual accuracy and clarity from 1-5). The evaluation compares predicted summaries against standard answers using detailed scoring rubrics.

The script processes prediction files through the OpenAI API with robust error handling, retry logic, and progress tracking. It handles batch processing with multiprocessing support and aggregates individual evaluations into a combined results file.

Usage

Use this script to automatically evaluate video summarization outputs from multimodal language models on the MLVU benchmark when manual scoring is impractical. Requires OpenAI API access and properly formatted prediction files.

Code Reference

Source Location

Repository: FlagOpen_FlagEmbedding
File: research/MLVU/evaluation/generation_evaluation/evaluate_summary.py
Lines: 1-206

Signature

def annotate(prediction_set, caption_files, output_dir):
    """
    Evaluates question and answer pairs using GPT-4
    """

def main():
    """
    Main function to control the flow of the program.
    """

Import

import openai
import argparse
import json
from multiprocessing.pool import Pool
from tqdm import tqdm

I/O Contract

Inputs

Name	Type	Required	Description
pred_path	str	Yes	Path to prediction JSON file with model summaries
output_dir	str	Yes	Directory for saving individual evaluation JSON files
output_json	str	Yes	Path to save aggregated results
api_key	str	Yes	OpenAI API key for GPT-4
num_tasks	int	No	Number of parallel processing splits (default: 1)

Outputs

Name	Type	Description
Individual JSON files	JSON	Per-summary evaluation with completeness and reliability scores
Combined JSON	JSON	Aggregated evaluation results with explanations

Usage Examples

# Command line execution
python evaluate_summary.py \
    --pred_path output_dir/qwen/pred_summary_all.json \
    --output_dir output_dir/qwen_summary_all \
    --output_json output_dir/qwen_summary_all_results.json \
    --api_key YOUR_OPENAI_API_KEY \
    --num_tasks 4

# Expected prediction format:
[
    {
        "video_name": "video1.mp4",
        "Q": "Summarize the video",
        "A": "Ground truth summary",
        "pred": "Model generated summary"
    }
]

Related Pages

Principle:FlagOpen_FlagEmbedding_Long_Video_Understanding_Evaluation

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment