Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:FlagOpen FlagEmbedding VideoLLaVA Open Bench

From Leeroopedia


Knowledge Sources
Domains Video Understanding, Open-ended Evaluation, MLVU Benchmark
Last Updated 2026-02-09 00:00 GMT

Overview

An evaluation script for Video-LLaVA model on open-ended video understanding tasks from the MLVU benchmark.

Description

This implementation provides an evaluation pipeline for Video-LLaVA-7B model on MLVU benchmark's open-ended tasks, specifically for subplot identification and video summarization. The script uses the LanguageBind Video-LLaVA architecture with 4-bit quantization to generate free-form textual responses about video content. Unlike the choice-based evaluation, this version allows the model to generate detailed descriptions and summaries without being constrained to predefined options. The implementation processes videos through the Video-LLaVA processor, tokenizes with image tokens, and generates extended responses (up to 1024 tokens) for subplot and summary tasks.

Usage

Use this script to evaluate Video-LLaVA models on MLVU benchmark tasks requiring descriptive text generation, particularly for understanding sub-scenes and creating video summaries.

Code Reference

Source Location

Key Components

class MLVU(Dataset):
    def __init__(self, data_dir, data_list):
        # Dataset for open-ended Video-LLaVA evaluation

    def qa_template(self, data):
        # Simple template for open-ended questions
        question = f"{data['question']}"
        answer = data['answer']
        return question, answer

# Main evaluation loop
for example in dataset:
    video_tensor = video_processor(video_path, return_tensors='pt')['pixel_values']
    conv.append_message(conv.roles[0], inp)
    conv.append_message(conv.roles[1], None)  # Open-ended response

Import

# Video-LLaVA components
from videollava.constants import IMAGE_TOKEN_INDEX, DEFAULT_IMAGE_TOKEN
from videollava.conversation import conv_templates, SeparatorStyle
from videollava.model.builder import load_pretrained_model
from videollava.utils import disable_torch_init
from videollava.mm_utils import tokenizer_image_token, get_model_name_from_path, KeywordsStoppingCriteria

# Standard libraries
import torch
import json
from tqdm import tqdm
import os

I/O Contract

Inputs

Name Type Required Description
data_dir str Yes Directory with subplot and summary JSON files
data_list dict Yes Mapping of subPlot and summary task configurations
model_path str Yes Path to Video-LLaVA model (LanguageBind/Video-LLaVA-7B)
device str Yes CUDA device (e.g., 'cuda:6')
load_4bit bool No Enable 4-bit quantization (default: True)

Outputs

Name Type Description
subplot_all.json JSON file Subplot predictions with video names, questions, answers, and predictions
summary_all.json JSON file Summary predictions with video names, questions, answers, and predictions
Console output text Ground truth and predictions during inference

Usage Examples

# Initialize Video-LLaVA model
disable_torch_init()

data_list = {
    "subPlot": ("8_sub_scene.json", "/LVBench_all/video/subPlot", "video"),
    "summary": ("9_summary.json", "/LVBench_all/video/summary", "video")
}

model_path = 'LanguageBind/Video-LLaVA-7B'
device = 'cuda:6'
load_4bit = True

tokenizer, model, processor, _ = load_pretrained_model(
    model_path, None, get_model_name_from_path(model_path),
    False, load_4bit, device=device
)
video_processor = processor['video']

# Process video for open-ended generation
conv_mode = "llava_v1"
conv = conv_templates[conv_mode].copy()

video_tensor = video_processor(video_path, return_tensors='pt')['pixel_values']
tensor = video_tensor.to(model.device, dtype=torch.float16)

# Prepare input with image tokens
inp = ' '.join([DEFAULT_IMAGE_TOKEN] * model.get_video_tower().config.num_frames) + '\n' + question
conv.system = "Carefully watch this video and pay attention to every detail. Based on your observations, answer the given questions."
conv.append_message(conv.roles[0], inp)
conv.append_message(conv.roles[1], None)  # Open-ended

# Generate response
output_ids = model.generate(
    input_ids,
    images=tensor,
    do_sample=True,
    temperature=0.1,
    max_new_tokens=1024
)

pred = tokenizer.decode(output_ids[0, input_ids.shape[1]:]).strip().replace("</s>", "")

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment