Workflow:Facebookresearch Habitat lab Agent Benchmarking

Knowledge Sources	Habitat-Lab Habitat Docs Habitat 1.0
Domains	Embodied_AI, Evaluation, Benchmarking
Last Updated	2026-02-15 02:00 GMT

Overview

End-to-end process for evaluating embodied agent performance on standard navigation and interaction tasks using the Habitat Benchmark framework with reproducible metrics.

Description

This workflow covers evaluating agents on Habitat tasks using the standardized Benchmark class, which provides a consistent evaluation protocol across different agent implementations. The process includes defining or loading an agent, selecting an evaluation dataset, running the agent through episodes, and collecting standard metrics (SPL, Success Rate, Distance to Goal for navigation; task completion for rearrangement). It supports both simple hand-coded agents and trained neural network policies, and follows the Habitat Challenge evaluation protocol for reproducible comparison.

Usage

Execute this workflow when you need to measure agent performance on a standard Habitat task, compare multiple agent implementations, prepare a Habitat Challenge submission, or establish baseline performance numbers for a new task or dataset.

Execution Steps

Step 1: Agent Implementation

Define or load the agent to be evaluated. Agents implement the `habitat.Agent` interface with `reset()` and `act(observations)` methods. For trained policies, wrap the model checkpoint in a PPOAgent that loads weights and performs inference. For baselines, implement simple hand-coded strategies (forward-only, random, goal-follower).

Key considerations:

All agents must implement the `habitat.Agent` abstract interface
PPOAgent wraps trained RL policies for evaluation via the Benchmark class
Simple agents (ForwardOnlyAgent, GoalFollower) serve as baselines
ShortestPathFollower provides an oracle upper bound for navigation tasks

Step 2: Task and Dataset Selection

Select the evaluation task configuration and episode dataset. Task configs define the observation space, action space, and success criteria. Episode datasets provide standardized evaluation splits with fixed start positions and goals for reproducible comparison.

Key considerations:

Use benchmark configs under `habitat-lab/habitat/config/benchmark/` for standardized evaluation
PointNav, ObjectNav, ImageNav, and VLN each have task-specific configs
Evaluation datasets are separate from training datasets to prevent overfitting
The Habitat Challenge uses specific dataset versions for fair comparison

Step 3: Benchmark Configuration

Create a Benchmark instance with the selected task configuration. Configure evaluation parameters including the number of episodes, video recording options, and any sensor overrides. The Benchmark class manages environment creation, episode iteration, and metric aggregation.

Key considerations:

The Benchmark class wraps environment creation and episode management
Video recording can be enabled for qualitative analysis
Episode count can be limited for quick validation runs
Configuration overrides allow testing different sensor setups without changing configs

Step 4: Evaluation Execution

Run the agent through the evaluation episodes. For each episode, the agent receives observations and returns actions until the episode terminates (success, failure, or max steps). The environment collects measurements at each step and computes final episode metrics.

Key considerations:

Episodes auto-terminate on success condition, max steps, or explicit stop action
Navigation metrics include SPL, SoftSPL, Success Rate, and Distance to Goal
Rearrangement metrics include task completion percentage and per-object placement accuracy
Video frames are collected during evaluation for later visualization

Step 5: Metric Aggregation and Reporting

Aggregate per-episode metrics into summary statistics. Report mean values across all evaluation episodes for each metric. Generate evaluation videos showing agent trajectories with top-down map overlays for navigation tasks.

Key considerations:

Metrics are averaged across all evaluation episodes
Per-episode breakdowns help identify failure modes
Top-down map visualization shows agent path versus optimal path
Results should be compared against published baselines for context

Execution Diagram

GitHub URL

Workflow Repository