Implementation:FlagOpen FlagEmbedding MLVU Needle Data

Knowledge Sources	FlagOpen_FlagEmbedding
Domains	Video Understanding, Benchmark Data, Information Retrieval
Last Updated	2026-02-09 00:00 GMT

Overview

Benchmark dataset testing needle-in-haystack information retrieval from long videos in the MLVU evaluation framework.

Description

The MLVU Needle dataset challenges models to find specific information ("needles") within long video content ("haystacks"). With 4617 lines, this dataset contains multiple-choice questions about specific events, actions, or details that appear at various points in lengthy videos. The task requires models to locate and comprehend precise moments or information embedded within extended video sequences, similar to needle-in-haystack retrieval tasks but in the video domain.

Questions test the model's ability to:

Locate specific events in long videos
Identify particular actions or objects
Extract precise information from video context
Handle temporal reasoning across extended sequences

Usage

Use this dataset for evaluating video retrieval capabilities, testing long-context video understanding, or benchmarking models on fine-grained information extraction from videos.

Code Reference

Source Location

Repository: FlagOpen_FlagEmbedding
File: research/MLVU/data/2_needle.json

Data Structure

{
    "video": str,              # Video filename
    "duration": float,         # Video duration in seconds (decimal)
    "question": str,           # Specific question about video content
    "candidates": List[str],   # Four candidate answers
    "answer": str,             # Correct answer
    "question_type": str       # Always "findNeedle"
}

Import

import json

# Load needle dataset
with open("research/MLVU/data/2_needle.json", "r") as f:
    needle_data = [json.loads(line) for line in f]

I/O Contract

Inputs

Name	Type	Required	Description
file_path	str	Yes	Path to the needle dataset JSON file

Outputs

Field	Type	Description
video	str	Video filename
duration	float	Video length in seconds (with decimals)
question	str	Question requiring needle finding
candidates	List[str]	Four possible answers
answer	str	Correct answer
question_type	str	Type identifier ("findNeedle")

Usage Examples

import json
from typing import List, Dict

# Load and analyze needle dataset
def load_needle_data(file_path: str) -> List[Dict]:
    with open(file_path, "r") as f:
        return [json.loads(line) for line in f]

data = load_needle_data("research/MLVU/data/2_needle.json")

# Example entry
example = data[0]
print(f"Video: {example['video']}")
print(f"Duration: {example['duration']:.2f}s")
print(f"Question: {example['question']}")
print(f"Candidates: {example['candidates']}")
print(f"Answer: {example['answer']}")

# Output:
# Video: needle_32.mp4
# Duration: 467.98s
# Question: What does the hand coming out of the computer do?
# Candidates: ['Delivers a product', 'Shakes the woman's hand',
#              'Takes the woman's credit card', 'Points at something on the screen']
# Answer: Delivers a product

# Evaluate needle finding capability
def evaluate_needle_finding(model, data: List[Dict]) -> Dict[str, float]:
    results = {
        "short_video_acc": 0.0,  # < 5 min
        "medium_video_acc": 0.0,  # 5-15 min
        "long_video_acc": 0.0,    # > 15 min
        "overall_acc": 0.0
    }

    short_correct, short_total = 0, 0
    medium_correct, medium_total = 0, 0
    long_correct, long_total = 0, 0

    for item in data:
        duration = item['duration']
        video_path = f"videos/{item['video']}"

        # Predict answer
        prediction = model.find_needle(
            video_path,
            item['question'],
            item['candidates']
        )

        is_correct = (prediction == item['answer'])

        # Categorize by duration
        if duration < 300:
            short_total += 1
            short_correct += is_correct
        elif duration < 900:
            medium_total += 1
            medium_correct += is_correct
        else:
            long_total += 1
            long_correct += is_correct

    results["short_video_acc"] = short_correct / short_total if short_total > 0 else 0
    results["medium_video_acc"] = medium_correct / medium_total if medium_total > 0 else 0
    results["long_video_acc"] = long_correct / long_total if long_total > 0 else 0
    results["overall_acc"] = (short_correct + medium_correct + long_correct) / len(data)

    return results

# Statistics
durations = [item['duration'] for item in data]
print(f"Average duration: {sum(durations)/len(durations):.2f}s")
print(f"Max duration: {max(durations):.2f}s")
print(f"Min duration: {min(durations):.2f}s")

# Sample by difficulty (longer videos are harder)
easy_samples = sorted(data, key=lambda x: x['duration'])[:100]
hard_samples = sorted(data, key=lambda x: x['duration'], reverse=True)[:100]

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment