Implementation:FlagOpen FlagEmbedding MLVU Ego Data
| Knowledge Sources | |
|---|---|
| Domains | Video Understanding, Egocentric Vision, Question Answering |
| Last Updated | 2026-02-09 00:00 GMT |
Overview
Benchmark dataset for egocentric video understanding with first-person perspective question answering.
Description
The MLVU Ego dataset contains 4578 egocentric video questions that test models' ability to understand first-person perspective videos. These questions typically involve activities recorded from the camera wearer's viewpoint, requiring understanding of object interactions, spatial reasoning, and temporal sequences from an egocentric perspective. Questions often use first-person language ("What did I...") and focus on actions, object locations, and procedural understanding.
This dataset is particularly challenging as it requires:
- Understanding first-person perspective and camera motion
- Tracking object interactions and manipulations
- Spatial reasoning from egocentric viewpoint
- Temporal reasoning about action sequences
- Memory of past events in the video
Usage
Use this dataset for evaluating egocentric video understanding, benchmarking action recognition from first-person view, or training models on procedural task understanding.
Code Reference
Source Location
- Repository: FlagOpen_FlagEmbedding
- File: research/MLVU/data/3_ego.json
Data Structure
{
"video": str, # Video filename
"duration": float, # Video duration in seconds
"question": str, # First-person question (often "What did I...")
"candidates": List[str], # Four candidate answers
"answer": str, # Correct answer
"question_type": str # Always "ego"
}
Import
import json
# Load egocentric video dataset
with open("research/MLVU/data/3_ego.json", "r") as f:
ego_data = [json.loads(line) for line in f]
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| file_path | str | Yes | Path to the ego dataset JSON file |
Outputs
| Field | Type | Description |
|---|---|---|
| video | str | Egocentric video filename |
| duration | float | Video duration in seconds |
| question | str | First-person perspective question |
| candidates | List[str] | Four possible answers |
| answer | str | Correct answer |
| question_type | str | Type identifier ("ego") |
Usage Examples
import json
from typing import List, Dict
# Load ego dataset
def load_ego_data(file_path: str) -> List[Dict]:
with open(file_path, "r") as f:
return [json.loads(line) for line in f]
data = load_ego_data("research/MLVU/data/3_ego.json")
# Example entry
example = data[0]
print(f"Video: {example['video']}")
print(f"Duration: {example['duration']:.2f}s")
print(f"Question: {example['question']}")
print(f"Candidates: {example['candidates']}")
print(f"Answer: {example['answer']}")
# Output:
# Video: ego_35.mp4
# Duration: 408.63s
# Question: What did I put in the orange trashcan
# Candidates: ['a lemon green sponge', 'a blue pen',
# 'a red apple', 'a yellow banana']
# Answer: a lemon green sponge
# Analyze question patterns
def analyze_question_patterns(data: List[Dict]) -> Dict:
patterns = {
"what_did_i": 0,
"where_was": 0,
"how_many": 0,
"which": 0,
"other": 0
}
for item in data:
question_lower = item['question'].lower()
if "what did i" in question_lower:
patterns["what_did_i"] += 1
elif "where was" in question_lower:
patterns["where_was"] += 1
elif "how many" in question_lower:
patterns["how_many"] += 1
elif "which" in question_lower:
patterns["which"] += 1
else:
patterns["other"] += 1
return patterns
patterns = analyze_question_patterns(data)
print("Question patterns:", patterns)
# Evaluate egocentric understanding
def evaluate_egocentric(model, data: List[Dict]) -> float:
correct = 0
for item in data:
video_path = f"videos/{item['video']}"
# Model predicts from egocentric perspective
prediction = model.answer_egocentric(
video_path,
item['question'],
item['candidates']
)
if prediction == item['answer']:
correct += 1
accuracy = correct / len(data)
return accuracy
# Filter by action type
object_interaction_questions = [
item for item in data
if "put" in item['question'].lower() or
"pick" in item['question'].lower() or
"place" in item['question'].lower()
]
location_questions = [
item for item in data
if "where" in item['question'].lower()
]
counting_questions = [
item for item in data
if "how many" in item['question'].lower()
]
print(f"Object interaction: {len(object_interaction_questions)}")
print(f"Location: {len(location_questions)}")
print(f"Counting: {len(counting_questions)}")