Workflow:Facebookresearch Habitat lab Agent Benchmarking
| Knowledge Sources | |
|---|---|
| Domains | Embodied_AI, Evaluation, Benchmarking |
| Last Updated | 2026-02-15 02:00 GMT |
Overview
End-to-end process for evaluating embodied agent performance on standard navigation and interaction tasks using the Habitat Benchmark framework with reproducible metrics.
Description
This workflow covers evaluating agents on Habitat tasks using the standardized Benchmark class, which provides a consistent evaluation protocol across different agent implementations. The process includes defining or loading an agent, selecting an evaluation dataset, running the agent through episodes, and collecting standard metrics (SPL, Success Rate, Distance to Goal for navigation; task completion for rearrangement). It supports both simple hand-coded agents and trained neural network policies, and follows the Habitat Challenge evaluation protocol for reproducible comparison.
Usage
Execute this workflow when you need to measure agent performance on a standard Habitat task, compare multiple agent implementations, prepare a Habitat Challenge submission, or establish baseline performance numbers for a new task or dataset.
Execution Steps
Step 1: Agent Implementation
Define or load the agent to be evaluated. Agents implement the `habitat.Agent` interface with `reset()` and `act(observations)` methods. For trained policies, wrap the model checkpoint in a PPOAgent that loads weights and performs inference. For baselines, implement simple hand-coded strategies (forward-only, random, goal-follower).
Key considerations:
- All agents must implement the `habitat.Agent` abstract interface
- PPOAgent wraps trained RL policies for evaluation via the Benchmark class
- Simple agents (ForwardOnlyAgent, GoalFollower) serve as baselines
- ShortestPathFollower provides an oracle upper bound for navigation tasks
Step 2: Task and Dataset Selection
Select the evaluation task configuration and episode dataset. Task configs define the observation space, action space, and success criteria. Episode datasets provide standardized evaluation splits with fixed start positions and goals for reproducible comparison.
Key considerations:
- Use benchmark configs under `habitat-lab/habitat/config/benchmark/` for standardized evaluation
- PointNav, ObjectNav, ImageNav, and VLN each have task-specific configs
- Evaluation datasets are separate from training datasets to prevent overfitting
- The Habitat Challenge uses specific dataset versions for fair comparison
Step 3: Benchmark Configuration
Create a Benchmark instance with the selected task configuration. Configure evaluation parameters including the number of episodes, video recording options, and any sensor overrides. The Benchmark class manages environment creation, episode iteration, and metric aggregation.
Key considerations:
- The Benchmark class wraps environment creation and episode management
- Video recording can be enabled for qualitative analysis
- Episode count can be limited for quick validation runs
- Configuration overrides allow testing different sensor setups without changing configs
Step 4: Evaluation Execution
Run the agent through the evaluation episodes. For each episode, the agent receives observations and returns actions until the episode terminates (success, failure, or max steps). The environment collects measurements at each step and computes final episode metrics.
Key considerations:
- Episodes auto-terminate on success condition, max steps, or explicit stop action
- Navigation metrics include SPL, SoftSPL, Success Rate, and Distance to Goal
- Rearrangement metrics include task completion percentage and per-object placement accuracy
- Video frames are collected during evaluation for later visualization
Step 5: Metric Aggregation and Reporting
Aggregate per-episode metrics into summary statistics. Report mean values across all evaluation episodes for each metric. Generate evaluation videos showing agent trajectories with top-down map overlays for navigation tasks.
Key considerations:
- Metrics are averaged across all evaluation episodes
- Per-episode breakdowns help identify failure modes
- Top-down map visualization shows agent path versus optimal path
- Results should be compared against published baselines for context