Implementation:FlagOpen FlagEmbedding VideoChat2 Open Bench

Knowledge Sources	FlagOpen_FlagEmbedding
Domains	Video Understanding, Open-ended Evaluation, MLVU Benchmark
Last Updated	2026-02-09 00:00 GMT

Overview

An evaluation script for VideoChat2 model on open-ended video understanding tasks from the MLVU benchmark.

Description

This implementation provides an evaluation pipeline for the VideoChat2 model on MLVU benchmark's open-ended tasks, specifically for subplot and summary generation. Unlike the choice-based evaluation, this script evaluates the model's ability to generate free-form text responses to questions about video content. It uses the same VideoChat2 architecture with LoRA adaptations but focuses on two specific tasks: subplot (sub-scene) understanding and video summarization. The script processes long videos, extracts frames, and generates detailed textual responses using the VideoChat2 model with extended generation capabilities (up to 1000 tokens).

Usage

Use this script to evaluate VideoChat2 models on MLVU benchmark tasks that require generating descriptive text responses about video content, particularly for subplot identification and video summarization tasks.

Code Reference

Source Location

Repository: FlagOpen_FlagEmbedding
File: research/MLVU/evaluation/models/videochat2/open_bench.py
Lines: 1-473

Key Components

class MLVU(Dataset):
    def __init__(self, data_dir, data_list, num_segments=8, resolution=224):
        # Dataset initialization for open-ended tasks

    def qa_template(self, data):
        # Simple question-answer template without multiple choices
        question = f"{data['question']}"
        answer = data['answer']
        return question, answer

def infer_mvbench(data_sample, system="", question_prompt='',
                  answer_prompt=None, return_prompt='',
                  system_q=False, print_res=False, system_llm=False):
    # Inference with extended max_new_tokens for detailed responses

Import

# Model components
from models import VideoChat2_it_vicuna
from utils.config import Config
from utils.easydict import EasyDict

# Video processing
from decord import VideoReader, cpu
from dataset.video_transforms import (
    GroupNormalize, GroupScale, GroupCenterCrop,
    Stack, ToTorchFormatTensor
)

# Deep learning
import torch
from transformers import StoppingCriteria, StoppingCriteriaList
from peft import get_peft_model, LoraConfig, TaskType

I/O Contract

Inputs

Name	Type	Required	Description
data_dir	str	Yes	Directory containing JSON annotation files for subplot and summary tasks
data_list	dict	Yes	Mapping with subPlot and summary task configurations
num_segments	int	No	Number of frames to sample (default: 16)
resolution	int	No	Target resolution for frames (default: 224)
model checkpoint	str	Yes	Path to videochat2_7b_stage3.pth checkpoint

Outputs

Name	Type	Description
subplot_all.json	JSON file	Contains subplot predictions with video names, questions, answers, and predictions
summary_all.json	JSON file	Contains summary predictions with video names, questions, answers, and predictions
Console output	text	Prediction outputs during inference

Usage Examples

# Data configuration for open-ended tasks
data_list = {
    "subPlot": ("8_sub_scene.json", "/MLVU_all/video/subPlot", "video"),
    "summary": ("9_summary.json", "/MLVU_all/video/summary", "video")
}

# Initialize dataset
dataset = MLVU(data_dir="/MLVU_all/json", data_list=data_list,
               num_segments=16, resolution=224)

# Run inference with extended generation
for example in dataset:
    task_type = example['task_type']
    pred = infer_mvbench(
        example,
        system="Carefully watch this video and pay attention to every detail. Based on your observations, answer the given questions.\n",
        question_prompt="",
        answer_prompt="",
        max_new_tokens=1000  # Allow longer responses
    )

    # Store results by task type
    if task_type == "subPlot":
        result = {
            "video_name": example['video_path'].split("/")[-1],
            'Q': example['question'],
            'A': example['answer'],
            'pred': pred
        }

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment