Principle:FlagOpen FlagEmbedding Long Video Understanding Evaluation
| Knowledge Sources | |
|---|---|
| Domains | Machine Learning, Computer Vision, Video Understanding, Multimodal LLMs |
| Last Updated | 2026-02-09 00:00 GMT |
Overview
A comprehensive evaluation framework for assessing multi-modal large language models on long video understanding tasks across holistic comprehension, single-detail extraction, and multi-detail reasoning.
Description
This principle introduces the MLVU (Multi-modal Long Video Understanding) benchmark, a systematic evaluation framework designed to test video-language models on extended video content. The benchmark covers diverse task types including plot question-answering, anomaly detection, needle-in-a-haystack retrieval, topic reasoning, action ordering, counting, egocentric reasoning, summarization, and sub-scene captioning. Unlike short-video benchmarks, MLVU emphasizes temporal understanding over long durations (minutes to hours), testing models' ability to maintain context and extract relevant information from extensive visual content. The framework supports both multiple-choice (closed-ended) and open-ended generation evaluation formats.
Usage
Use this principle when:
- Evaluating video-language models on long-form content understanding
- Benchmarking multi-modal LLMs for temporal reasoning capabilities
- Assessing models' ability to extract fine-grained details from extended videos
- Comparing different video encoding and frame sampling strategies
Theoretical Basis
The evaluation framework categorizes tasks into three dimensions:
- Holistic Understanding: Tasks requiring global video comprehension
- Topic reasoning: Identify overall themes
- Summarization: Generate comprehensive video summaries
- Metrics: ROUGE-L, BERTScore for open-ended; accuracy for choice
- Single-Detail Extraction: Tasks focusing on specific moments
- Anomaly detection: Identify unusual events
- Needle-in-haystack: Locate specific information
- Sub-scene captioning: Describe particular segments
- Metrics: Exact match, semantic similarity
- Multi-Detail Reasoning: Tasks requiring aggregation across multiple moments
- Action ordering: Sequence temporal events
- Counting: Enumerate occurrences
- Egocentric reasoning: First-person perspective understanding
- Metrics: Temporal IOU, counting accuracy
Evaluation protocol:
- Sample frames uniformly: F = {f_1, ..., f_n} from video V
- Generate response: R = Model(F, Q) where Q is question
- Compare with ground truth: Score(R, GT) using task-specific metrics
Related Pages
- Implementation:FlagOpen_FlagEmbedding_MLVU_PlotQA_Data
- Implementation:FlagOpen_FlagEmbedding_MLVU_Anomaly_Reco_Data
- Implementation:FlagOpen_FlagEmbedding_MLVU_Needle_Data
- Implementation:FlagOpen_FlagEmbedding_MLVU_Topic_Reasoning_Data
- Implementation:FlagOpen_FlagEmbedding_MLVU_Order_Data
- Implementation:FlagOpen_FlagEmbedding_MLVU_Count_Data
- Implementation:FlagOpen_FlagEmbedding_MLVU_Ego_Data
- Implementation:FlagOpen_FlagEmbedding_MLVU_Sub_Scene_Data
- Implementation:FlagOpen_FlagEmbedding_MLVU_Summary_Data
- Implementation:FlagOpen_FlagEmbedding_VideoChat2_Choice_Bench
- Implementation:FlagOpen_FlagEmbedding_VideoChat2_Open_Bench
- Implementation:FlagOpen_FlagEmbedding_VideoLLaVA_Choice_Bench
- Implementation:FlagOpen_FlagEmbedding_VideoLLaVA_Open_Bench
- Implementation:FlagOpen_FlagEmbedding_MLVU_Evaluate_SSC
- Implementation:FlagOpen_FlagEmbedding_MLVU_Evaluate_Summary
- Implementation:FlagOpen_FlagEmbedding_MLVU_Open_Bench
- Implementation:FlagOpen_FlagEmbedding_MLVU_Choice_Bench