Implementation:FlagOpen FlagEmbedding VideoChat2 Open Bench
| Knowledge Sources | |
|---|---|
| Domains | Video Understanding, Open-ended Evaluation, MLVU Benchmark |
| Last Updated | 2026-02-09 00:00 GMT |
Overview
An evaluation script for VideoChat2 model on open-ended video understanding tasks from the MLVU benchmark.
Description
This implementation provides an evaluation pipeline for the VideoChat2 model on MLVU benchmark's open-ended tasks, specifically for subplot and summary generation. Unlike the choice-based evaluation, this script evaluates the model's ability to generate free-form text responses to questions about video content. It uses the same VideoChat2 architecture with LoRA adaptations but focuses on two specific tasks: subplot (sub-scene) understanding and video summarization. The script processes long videos, extracts frames, and generates detailed textual responses using the VideoChat2 model with extended generation capabilities (up to 1000 tokens).
Usage
Use this script to evaluate VideoChat2 models on MLVU benchmark tasks that require generating descriptive text responses about video content, particularly for subplot identification and video summarization tasks.
Code Reference
Source Location
- Repository: FlagOpen_FlagEmbedding
- File: research/MLVU/evaluation/models/videochat2/open_bench.py
- Lines: 1-473
Key Components
class MLVU(Dataset):
def __init__(self, data_dir, data_list, num_segments=8, resolution=224):
# Dataset initialization for open-ended tasks
def qa_template(self, data):
# Simple question-answer template without multiple choices
question = f"{data['question']}"
answer = data['answer']
return question, answer
def infer_mvbench(data_sample, system="", question_prompt='',
answer_prompt=None, return_prompt='',
system_q=False, print_res=False, system_llm=False):
# Inference with extended max_new_tokens for detailed responses
Import
# Model components
from models import VideoChat2_it_vicuna
from utils.config import Config
from utils.easydict import EasyDict
# Video processing
from decord import VideoReader, cpu
from dataset.video_transforms import (
GroupNormalize, GroupScale, GroupCenterCrop,
Stack, ToTorchFormatTensor
)
# Deep learning
import torch
from transformers import StoppingCriteria, StoppingCriteriaList
from peft import get_peft_model, LoraConfig, TaskType
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| data_dir | str | Yes | Directory containing JSON annotation files for subplot and summary tasks |
| data_list | dict | Yes | Mapping with subPlot and summary task configurations |
| num_segments | int | No | Number of frames to sample (default: 16) |
| resolution | int | No | Target resolution for frames (default: 224) |
| model checkpoint | str | Yes | Path to videochat2_7b_stage3.pth checkpoint |
Outputs
| Name | Type | Description |
|---|---|---|
| subplot_all.json | JSON file | Contains subplot predictions with video names, questions, answers, and predictions |
| summary_all.json | JSON file | Contains summary predictions with video names, questions, answers, and predictions |
| Console output | text | Prediction outputs during inference |
Usage Examples
# Data configuration for open-ended tasks
data_list = {
"subPlot": ("8_sub_scene.json", "/MLVU_all/video/subPlot", "video"),
"summary": ("9_summary.json", "/MLVU_all/video/summary", "video")
}
# Initialize dataset
dataset = MLVU(data_dir="/MLVU_all/json", data_list=data_list,
num_segments=16, resolution=224)
# Run inference with extended generation
for example in dataset:
task_type = example['task_type']
pred = infer_mvbench(
example,
system="Carefully watch this video and pay attention to every detail. Based on your observations, answer the given questions.\n",
question_prompt="",
answer_prompt="",
max_new_tokens=1000 # Allow longer responses
)
# Store results by task type
if task_type == "subPlot":
result = {
"video_name": example['video_path'].split("/")[-1],
'Q': example['question'],
'A': example['answer'],
'pred': pred
}