Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:FlagOpen FlagEmbedding MLVU Evaluate Summary

From Leeroopedia


Knowledge Sources
Domains Video Understanding, Evaluation, Natural Language Processing
Last Updated 2026-02-09 00:00 GMT

Overview

GPT-4 based evaluation script for video summarization tasks in MLVU benchmark.

Description

This script evaluates video summarization model predictions using GPT-4 as an automated judge. It assesses summaries on two key dimensions: completeness (how well the summary covers key points from 1-5) and reliability (factual accuracy and clarity from 1-5). The evaluation compares predicted summaries against standard answers using detailed scoring rubrics.

The script processes prediction files through the OpenAI API with robust error handling, retry logic, and progress tracking. It handles batch processing with multiprocessing support and aggregates individual evaluations into a combined results file.

Usage

Use this script to automatically evaluate video summarization outputs from multimodal language models on the MLVU benchmark when manual scoring is impractical. Requires OpenAI API access and properly formatted prediction files.

Code Reference

Source Location

Signature

def annotate(prediction_set, caption_files, output_dir):
    """
    Evaluates question and answer pairs using GPT-4
    """

def main():
    """
    Main function to control the flow of the program.
    """

Import

import openai
import argparse
import json
from multiprocessing.pool import Pool
from tqdm import tqdm

I/O Contract

Inputs

Name Type Required Description
pred_path str Yes Path to prediction JSON file with model summaries
output_dir str Yes Directory for saving individual evaluation JSON files
output_json str Yes Path to save aggregated results
api_key str Yes OpenAI API key for GPT-4
num_tasks int No Number of parallel processing splits (default: 1)

Outputs

Name Type Description
Individual JSON files JSON Per-summary evaluation with completeness and reliability scores
Combined JSON JSON Aggregated evaluation results with explanations

Usage Examples

# Command line execution
python evaluate_summary.py \
    --pred_path output_dir/qwen/pred_summary_all.json \
    --output_dir output_dir/qwen_summary_all \
    --output_json output_dir/qwen_summary_all_results.json \
    --api_key YOUR_OPENAI_API_KEY \
    --num_tasks 4

# Expected prediction format:
[
    {
        "video_name": "video1.mp4",
        "Q": "Summarize the video",
        "A": "Ground truth summary",
        "pred": "Model generated summary"
    }
]

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment