Implementation:FlagOpen FlagEmbedding MLVU Needle Data
| Knowledge Sources | |
|---|---|
| Domains | Video Understanding, Benchmark Data, Information Retrieval |
| Last Updated | 2026-02-09 00:00 GMT |
Overview
Benchmark dataset testing needle-in-haystack information retrieval from long videos in the MLVU evaluation framework.
Description
The MLVU Needle dataset challenges models to find specific information ("needles") within long video content ("haystacks"). With 4617 lines, this dataset contains multiple-choice questions about specific events, actions, or details that appear at various points in lengthy videos. The task requires models to locate and comprehend precise moments or information embedded within extended video sequences, similar to needle-in-haystack retrieval tasks but in the video domain.
Questions test the model's ability to:
- Locate specific events in long videos
- Identify particular actions or objects
- Extract precise information from video context
- Handle temporal reasoning across extended sequences
Usage
Use this dataset for evaluating video retrieval capabilities, testing long-context video understanding, or benchmarking models on fine-grained information extraction from videos.
Code Reference
Source Location
- Repository: FlagOpen_FlagEmbedding
- File: research/MLVU/data/2_needle.json
Data Structure
{
"video": str, # Video filename
"duration": float, # Video duration in seconds (decimal)
"question": str, # Specific question about video content
"candidates": List[str], # Four candidate answers
"answer": str, # Correct answer
"question_type": str # Always "findNeedle"
}
Import
import json
# Load needle dataset
with open("research/MLVU/data/2_needle.json", "r") as f:
needle_data = [json.loads(line) for line in f]
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| file_path | str | Yes | Path to the needle dataset JSON file |
Outputs
| Field | Type | Description |
|---|---|---|
| video | str | Video filename |
| duration | float | Video length in seconds (with decimals) |
| question | str | Question requiring needle finding |
| candidates | List[str] | Four possible answers |
| answer | str | Correct answer |
| question_type | str | Type identifier ("findNeedle") |
Usage Examples
import json
from typing import List, Dict
# Load and analyze needle dataset
def load_needle_data(file_path: str) -> List[Dict]:
with open(file_path, "r") as f:
return [json.loads(line) for line in f]
data = load_needle_data("research/MLVU/data/2_needle.json")
# Example entry
example = data[0]
print(f"Video: {example['video']}")
print(f"Duration: {example['duration']:.2f}s")
print(f"Question: {example['question']}")
print(f"Candidates: {example['candidates']}")
print(f"Answer: {example['answer']}")
# Output:
# Video: needle_32.mp4
# Duration: 467.98s
# Question: What does the hand coming out of the computer do?
# Candidates: ['Delivers a product', 'Shakes the woman's hand',
# 'Takes the woman's credit card', 'Points at something on the screen']
# Answer: Delivers a product
# Evaluate needle finding capability
def evaluate_needle_finding(model, data: List[Dict]) -> Dict[str, float]:
results = {
"short_video_acc": 0.0, # < 5 min
"medium_video_acc": 0.0, # 5-15 min
"long_video_acc": 0.0, # > 15 min
"overall_acc": 0.0
}
short_correct, short_total = 0, 0
medium_correct, medium_total = 0, 0
long_correct, long_total = 0, 0
for item in data:
duration = item['duration']
video_path = f"videos/{item['video']}"
# Predict answer
prediction = model.find_needle(
video_path,
item['question'],
item['candidates']
)
is_correct = (prediction == item['answer'])
# Categorize by duration
if duration < 300:
short_total += 1
short_correct += is_correct
elif duration < 900:
medium_total += 1
medium_correct += is_correct
else:
long_total += 1
long_correct += is_correct
results["short_video_acc"] = short_correct / short_total if short_total > 0 else 0
results["medium_video_acc"] = medium_correct / medium_total if medium_total > 0 else 0
results["long_video_acc"] = long_correct / long_total if long_total > 0 else 0
results["overall_acc"] = (short_correct + medium_correct + long_correct) / len(data)
return results
# Statistics
durations = [item['duration'] for item in data]
print(f"Average duration: {sum(durations)/len(durations):.2f}s")
print(f"Max duration: {max(durations):.2f}s")
print(f"Min duration: {min(durations):.2f}s")
# Sample by difficulty (longer videos are harder)
easy_samples = sorted(data, key=lambda x: x['duration'])[:100]
hard_samples = sorted(data, key=lambda x: x['duration'], reverse=True)[:100]