Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:FlagOpen FlagEmbedding MLVU Ego Data

From Leeroopedia


Knowledge Sources
Domains Video Understanding, Egocentric Vision, Question Answering
Last Updated 2026-02-09 00:00 GMT

Overview

Benchmark dataset for egocentric video understanding with first-person perspective question answering.

Description

The MLVU Ego dataset contains 4578 egocentric video questions that test models' ability to understand first-person perspective videos. These questions typically involve activities recorded from the camera wearer's viewpoint, requiring understanding of object interactions, spatial reasoning, and temporal sequences from an egocentric perspective. Questions often use first-person language ("What did I...") and focus on actions, object locations, and procedural understanding.

This dataset is particularly challenging as it requires:

  • Understanding first-person perspective and camera motion
  • Tracking object interactions and manipulations
  • Spatial reasoning from egocentric viewpoint
  • Temporal reasoning about action sequences
  • Memory of past events in the video

Usage

Use this dataset for evaluating egocentric video understanding, benchmarking action recognition from first-person view, or training models on procedural task understanding.

Code Reference

Source Location

Data Structure

{
    "video": str,              # Video filename
    "duration": float,         # Video duration in seconds
    "question": str,           # First-person question (often "What did I...")
    "candidates": List[str],   # Four candidate answers
    "answer": str,             # Correct answer
    "question_type": str       # Always "ego"
}

Import

import json

# Load egocentric video dataset
with open("research/MLVU/data/3_ego.json", "r") as f:
    ego_data = [json.loads(line) for line in f]

I/O Contract

Inputs

Name Type Required Description
file_path str Yes Path to the ego dataset JSON file

Outputs

Field Type Description
video str Egocentric video filename
duration float Video duration in seconds
question str First-person perspective question
candidates List[str] Four possible answers
answer str Correct answer
question_type str Type identifier ("ego")

Usage Examples

import json
from typing import List, Dict

# Load ego dataset
def load_ego_data(file_path: str) -> List[Dict]:
    with open(file_path, "r") as f:
        return [json.loads(line) for line in f]

data = load_ego_data("research/MLVU/data/3_ego.json")

# Example entry
example = data[0]
print(f"Video: {example['video']}")
print(f"Duration: {example['duration']:.2f}s")
print(f"Question: {example['question']}")
print(f"Candidates: {example['candidates']}")
print(f"Answer: {example['answer']}")

# Output:
# Video: ego_35.mp4
# Duration: 408.63s
# Question: What did I put in the orange trashcan
# Candidates: ['a lemon green sponge', 'a blue pen',
#              'a red apple', 'a yellow banana']
# Answer: a lemon green sponge

# Analyze question patterns
def analyze_question_patterns(data: List[Dict]) -> Dict:
    patterns = {
        "what_did_i": 0,
        "where_was": 0,
        "how_many": 0,
        "which": 0,
        "other": 0
    }

    for item in data:
        question_lower = item['question'].lower()
        if "what did i" in question_lower:
            patterns["what_did_i"] += 1
        elif "where was" in question_lower:
            patterns["where_was"] += 1
        elif "how many" in question_lower:
            patterns["how_many"] += 1
        elif "which" in question_lower:
            patterns["which"] += 1
        else:
            patterns["other"] += 1

    return patterns

patterns = analyze_question_patterns(data)
print("Question patterns:", patterns)

# Evaluate egocentric understanding
def evaluate_egocentric(model, data: List[Dict]) -> float:
    correct = 0

    for item in data:
        video_path = f"videos/{item['video']}"

        # Model predicts from egocentric perspective
        prediction = model.answer_egocentric(
            video_path,
            item['question'],
            item['candidates']
        )

        if prediction == item['answer']:
            correct += 1

    accuracy = correct / len(data)
    return accuracy

# Filter by action type
object_interaction_questions = [
    item for item in data
    if "put" in item['question'].lower() or
       "pick" in item['question'].lower() or
       "place" in item['question'].lower()
]

location_questions = [
    item for item in data
    if "where" in item['question'].lower()
]

counting_questions = [
    item for item in data
    if "how many" in item['question'].lower()
]

print(f"Object interaction: {len(object_interaction_questions)}")
print(f"Location: {len(location_questions)}")
print(f"Counting: {len(counting_questions)}")

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment