Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:FlagOpen FlagEmbedding MLVU Needle Data

From Leeroopedia


Knowledge Sources
Domains Video Understanding, Benchmark Data, Information Retrieval
Last Updated 2026-02-09 00:00 GMT

Overview

Benchmark dataset testing needle-in-haystack information retrieval from long videos in the MLVU evaluation framework.

Description

The MLVU Needle dataset challenges models to find specific information ("needles") within long video content ("haystacks"). With 4617 lines, this dataset contains multiple-choice questions about specific events, actions, or details that appear at various points in lengthy videos. The task requires models to locate and comprehend precise moments or information embedded within extended video sequences, similar to needle-in-haystack retrieval tasks but in the video domain.

Questions test the model's ability to:

  • Locate specific events in long videos
  • Identify particular actions or objects
  • Extract precise information from video context
  • Handle temporal reasoning across extended sequences

Usage

Use this dataset for evaluating video retrieval capabilities, testing long-context video understanding, or benchmarking models on fine-grained information extraction from videos.

Code Reference

Source Location

Data Structure

{
    "video": str,              # Video filename
    "duration": float,         # Video duration in seconds (decimal)
    "question": str,           # Specific question about video content
    "candidates": List[str],   # Four candidate answers
    "answer": str,             # Correct answer
    "question_type": str       # Always "findNeedle"
}

Import

import json

# Load needle dataset
with open("research/MLVU/data/2_needle.json", "r") as f:
    needle_data = [json.loads(line) for line in f]

I/O Contract

Inputs

Name Type Required Description
file_path str Yes Path to the needle dataset JSON file

Outputs

Field Type Description
video str Video filename
duration float Video length in seconds (with decimals)
question str Question requiring needle finding
candidates List[str] Four possible answers
answer str Correct answer
question_type str Type identifier ("findNeedle")

Usage Examples

import json
from typing import List, Dict

# Load and analyze needle dataset
def load_needle_data(file_path: str) -> List[Dict]:
    with open(file_path, "r") as f:
        return [json.loads(line) for line in f]

data = load_needle_data("research/MLVU/data/2_needle.json")

# Example entry
example = data[0]
print(f"Video: {example['video']}")
print(f"Duration: {example['duration']:.2f}s")
print(f"Question: {example['question']}")
print(f"Candidates: {example['candidates']}")
print(f"Answer: {example['answer']}")

# Output:
# Video: needle_32.mp4
# Duration: 467.98s
# Question: What does the hand coming out of the computer do?
# Candidates: ['Delivers a product', 'Shakes the woman's hand',
#              'Takes the woman's credit card', 'Points at something on the screen']
# Answer: Delivers a product

# Evaluate needle finding capability
def evaluate_needle_finding(model, data: List[Dict]) -> Dict[str, float]:
    results = {
        "short_video_acc": 0.0,  # < 5 min
        "medium_video_acc": 0.0,  # 5-15 min
        "long_video_acc": 0.0,    # > 15 min
        "overall_acc": 0.0
    }

    short_correct, short_total = 0, 0
    medium_correct, medium_total = 0, 0
    long_correct, long_total = 0, 0

    for item in data:
        duration = item['duration']
        video_path = f"videos/{item['video']}"

        # Predict answer
        prediction = model.find_needle(
            video_path,
            item['question'],
            item['candidates']
        )

        is_correct = (prediction == item['answer'])

        # Categorize by duration
        if duration < 300:
            short_total += 1
            short_correct += is_correct
        elif duration < 900:
            medium_total += 1
            medium_correct += is_correct
        else:
            long_total += 1
            long_correct += is_correct

    results["short_video_acc"] = short_correct / short_total if short_total > 0 else 0
    results["medium_video_acc"] = medium_correct / medium_total if medium_total > 0 else 0
    results["long_video_acc"] = long_correct / long_total if long_total > 0 else 0
    results["overall_acc"] = (short_correct + medium_correct + long_correct) / len(data)

    return results

# Statistics
durations = [item['duration'] for item in data]
print(f"Average duration: {sum(durations)/len(durations):.2f}s")
print(f"Max duration: {max(durations):.2f}s")
print(f"Min duration: {min(durations):.2f}s")

# Sample by difficulty (longer videos are harder)
easy_samples = sorted(data, key=lambda x: x['duration'])[:100]
hard_samples = sorted(data, key=lambda x: x['duration'], reverse=True)[:100]

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment