Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:FlagOpen FlagEmbedding VideoChat2 Open Bench

From Leeroopedia


Knowledge Sources
Domains Video Understanding, Open-ended Evaluation, MLVU Benchmark
Last Updated 2026-02-09 00:00 GMT

Overview

An evaluation script for VideoChat2 model on open-ended video understanding tasks from the MLVU benchmark.

Description

This implementation provides an evaluation pipeline for the VideoChat2 model on MLVU benchmark's open-ended tasks, specifically for subplot and summary generation. Unlike the choice-based evaluation, this script evaluates the model's ability to generate free-form text responses to questions about video content. It uses the same VideoChat2 architecture with LoRA adaptations but focuses on two specific tasks: subplot (sub-scene) understanding and video summarization. The script processes long videos, extracts frames, and generates detailed textual responses using the VideoChat2 model with extended generation capabilities (up to 1000 tokens).

Usage

Use this script to evaluate VideoChat2 models on MLVU benchmark tasks that require generating descriptive text responses about video content, particularly for subplot identification and video summarization tasks.

Code Reference

Source Location

Key Components

class MLVU(Dataset):
    def __init__(self, data_dir, data_list, num_segments=8, resolution=224):
        # Dataset initialization for open-ended tasks

    def qa_template(self, data):
        # Simple question-answer template without multiple choices
        question = f"{data['question']}"
        answer = data['answer']
        return question, answer

def infer_mvbench(data_sample, system="", question_prompt='',
                  answer_prompt=None, return_prompt='',
                  system_q=False, print_res=False, system_llm=False):
    # Inference with extended max_new_tokens for detailed responses

Import

# Model components
from models import VideoChat2_it_vicuna
from utils.config import Config
from utils.easydict import EasyDict

# Video processing
from decord import VideoReader, cpu
from dataset.video_transforms import (
    GroupNormalize, GroupScale, GroupCenterCrop,
    Stack, ToTorchFormatTensor
)

# Deep learning
import torch
from transformers import StoppingCriteria, StoppingCriteriaList
from peft import get_peft_model, LoraConfig, TaskType

I/O Contract

Inputs

Name Type Required Description
data_dir str Yes Directory containing JSON annotation files for subplot and summary tasks
data_list dict Yes Mapping with subPlot and summary task configurations
num_segments int No Number of frames to sample (default: 16)
resolution int No Target resolution for frames (default: 224)
model checkpoint str Yes Path to videochat2_7b_stage3.pth checkpoint

Outputs

Name Type Description
subplot_all.json JSON file Contains subplot predictions with video names, questions, answers, and predictions
summary_all.json JSON file Contains summary predictions with video names, questions, answers, and predictions
Console output text Prediction outputs during inference

Usage Examples

# Data configuration for open-ended tasks
data_list = {
    "subPlot": ("8_sub_scene.json", "/MLVU_all/video/subPlot", "video"),
    "summary": ("9_summary.json", "/MLVU_all/video/summary", "video")
}

# Initialize dataset
dataset = MLVU(data_dir="/MLVU_all/json", data_list=data_list,
               num_segments=16, resolution=224)

# Run inference with extended generation
for example in dataset:
    task_type = example['task_type']
    pred = infer_mvbench(
        example,
        system="Carefully watch this video and pay attention to every detail. Based on your observations, answer the given questions.\n",
        question_prompt="",
        answer_prompt="",
        max_new_tokens=1000  # Allow longer responses
    )

    # Store results by task type
    if task_type == "subPlot":
        result = {
            "video_name": example['video_path'].split("/")[-1],
            'Q': example['question'],
            'A': example['answer'],
            'pred': pred
        }

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment