Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Workflow:Facebookresearch Habitat lab Agent Benchmarking

From Leeroopedia
Knowledge Sources
Domains Embodied_AI, Evaluation, Benchmarking
Last Updated 2026-02-15 02:00 GMT

Overview

End-to-end process for evaluating embodied agent performance on standard navigation and interaction tasks using the Habitat Benchmark framework with reproducible metrics.

Description

This workflow covers evaluating agents on Habitat tasks using the standardized Benchmark class, which provides a consistent evaluation protocol across different agent implementations. The process includes defining or loading an agent, selecting an evaluation dataset, running the agent through episodes, and collecting standard metrics (SPL, Success Rate, Distance to Goal for navigation; task completion for rearrangement). It supports both simple hand-coded agents and trained neural network policies, and follows the Habitat Challenge evaluation protocol for reproducible comparison.

Usage

Execute this workflow when you need to measure agent performance on a standard Habitat task, compare multiple agent implementations, prepare a Habitat Challenge submission, or establish baseline performance numbers for a new task or dataset.

Execution Steps

Step 1: Agent Implementation

Define or load the agent to be evaluated. Agents implement the `habitat.Agent` interface with `reset()` and `act(observations)` methods. For trained policies, wrap the model checkpoint in a PPOAgent that loads weights and performs inference. For baselines, implement simple hand-coded strategies (forward-only, random, goal-follower).

Key considerations:

  • All agents must implement the `habitat.Agent` abstract interface
  • PPOAgent wraps trained RL policies for evaluation via the Benchmark class
  • Simple agents (ForwardOnlyAgent, GoalFollower) serve as baselines
  • ShortestPathFollower provides an oracle upper bound for navigation tasks

Step 2: Task and Dataset Selection

Select the evaluation task configuration and episode dataset. Task configs define the observation space, action space, and success criteria. Episode datasets provide standardized evaluation splits with fixed start positions and goals for reproducible comparison.

Key considerations:

  • Use benchmark configs under `habitat-lab/habitat/config/benchmark/` for standardized evaluation
  • PointNav, ObjectNav, ImageNav, and VLN each have task-specific configs
  • Evaluation datasets are separate from training datasets to prevent overfitting
  • The Habitat Challenge uses specific dataset versions for fair comparison

Step 3: Benchmark Configuration

Create a Benchmark instance with the selected task configuration. Configure evaluation parameters including the number of episodes, video recording options, and any sensor overrides. The Benchmark class manages environment creation, episode iteration, and metric aggregation.

Key considerations:

  • The Benchmark class wraps environment creation and episode management
  • Video recording can be enabled for qualitative analysis
  • Episode count can be limited for quick validation runs
  • Configuration overrides allow testing different sensor setups without changing configs

Step 4: Evaluation Execution

Run the agent through the evaluation episodes. For each episode, the agent receives observations and returns actions until the episode terminates (success, failure, or max steps). The environment collects measurements at each step and computes final episode metrics.

Key considerations:

  • Episodes auto-terminate on success condition, max steps, or explicit stop action
  • Navigation metrics include SPL, SoftSPL, Success Rate, and Distance to Goal
  • Rearrangement metrics include task completion percentage and per-object placement accuracy
  • Video frames are collected during evaluation for later visualization

Step 5: Metric Aggregation and Reporting

Aggregate per-episode metrics into summary statistics. Report mean values across all evaluation episodes for each metric. Generate evaluation videos showing agent trajectories with top-down map overlays for navigation tasks.

Key considerations:

  • Metrics are averaged across all evaluation episodes
  • Per-episode breakdowns help identify failure modes
  • Top-down map visualization shows agent path versus optimal path
  • Results should be compared against published baselines for context

Execution Diagram

GitHub URL

Workflow Repository