Implementation:FlagOpen FlagEmbedding MLVU Ego Data

Knowledge Sources	FlagOpen_FlagEmbedding
Domains	Video Understanding, Egocentric Vision, Question Answering
Last Updated	2026-02-09 00:00 GMT

Overview

Benchmark dataset for egocentric video understanding with first-person perspective question answering.

Description

The MLVU Ego dataset contains 4578 egocentric video questions that test models' ability to understand first-person perspective videos. These questions typically involve activities recorded from the camera wearer's viewpoint, requiring understanding of object interactions, spatial reasoning, and temporal sequences from an egocentric perspective. Questions often use first-person language ("What did I...") and focus on actions, object locations, and procedural understanding.

This dataset is particularly challenging as it requires:

Understanding first-person perspective and camera motion
Tracking object interactions and manipulations
Spatial reasoning from egocentric viewpoint
Temporal reasoning about action sequences
Memory of past events in the video

Usage

Use this dataset for evaluating egocentric video understanding, benchmarking action recognition from first-person view, or training models on procedural task understanding.

Code Reference

Source Location

Repository: FlagOpen_FlagEmbedding
File: research/MLVU/data/3_ego.json

Data Structure

{
    "video": str,              # Video filename
    "duration": float,         # Video duration in seconds
    "question": str,           # First-person question (often "What did I...")
    "candidates": List[str],   # Four candidate answers
    "answer": str,             # Correct answer
    "question_type": str       # Always "ego"
}

Import

import json

# Load egocentric video dataset
with open("research/MLVU/data/3_ego.json", "r") as f:
    ego_data = [json.loads(line) for line in f]

I/O Contract

Inputs

Name	Type	Required	Description
file_path	str	Yes	Path to the ego dataset JSON file

Outputs

Field	Type	Description
video	str	Egocentric video filename
duration	float	Video duration in seconds
question	str	First-person perspective question
candidates	List[str]	Four possible answers
answer	str	Correct answer
question_type	str	Type identifier ("ego")

Usage Examples

import json
from typing import List, Dict

# Load ego dataset
def load_ego_data(file_path: str) -> List[Dict]:
    with open(file_path, "r") as f:
        return [json.loads(line) for line in f]

data = load_ego_data("research/MLVU/data/3_ego.json")

# Example entry
example = data[0]
print(f"Video: {example['video']}")
print(f"Duration: {example['duration']:.2f}s")
print(f"Question: {example['question']}")
print(f"Candidates: {example['candidates']}")
print(f"Answer: {example['answer']}")

# Output:
# Video: ego_35.mp4
# Duration: 408.63s
# Question: What did I put in the orange trashcan
# Candidates: ['a lemon green sponge', 'a blue pen',
#              'a red apple', 'a yellow banana']
# Answer: a lemon green sponge

# Analyze question patterns
def analyze_question_patterns(data: List[Dict]) -> Dict:
    patterns = {
        "what_did_i": 0,
        "where_was": 0,
        "how_many": 0,
        "which": 0,
        "other": 0
    }

    for item in data:
        question_lower = item['question'].lower()
        if "what did i" in question_lower:
            patterns["what_did_i"] += 1
        elif "where was" in question_lower:
            patterns["where_was"] += 1
        elif "how many" in question_lower:
            patterns["how_many"] += 1
        elif "which" in question_lower:
            patterns["which"] += 1
        else:
            patterns["other"] += 1

    return patterns

patterns = analyze_question_patterns(data)
print("Question patterns:", patterns)

# Evaluate egocentric understanding
def evaluate_egocentric(model, data: List[Dict]) -> float:
    correct = 0

    for item in data:
        video_path = f"videos/{item['video']}"

        # Model predicts from egocentric perspective
        prediction = model.answer_egocentric(
            video_path,
            item['question'],
            item['candidates']
        )

        if prediction == item['answer']:
            correct += 1

    accuracy = correct / len(data)
    return accuracy

# Filter by action type
object_interaction_questions = [
    item for item in data
    if "put" in item['question'].lower() or
       "pick" in item['question'].lower() or
       "place" in item['question'].lower()
]

location_questions = [
    item for item in data
    if "where" in item['question'].lower()
]

counting_questions = [
    item for item in data
    if "how many" in item['question'].lower()
]

print(f"Object interaction: {len(object_interaction_questions)}")
print(f"Location: {len(location_questions)}")
print(f"Counting: {len(counting_questions)}")

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment