Workflow:Facebookresearch Habitat lab PointNav PPO Training

Knowledge Sources	Habitat-Lab Habitat Docs DD-PPO PPO
Domains	Embodied_AI, Reinforcement_Learning, Navigation
Last Updated	2026-02-15 02:00 GMT

Overview

End-to-end process for training a PointNav agent using Proximal Policy Optimization (PPO) or Decentralized Distributed PPO (DD-PPO) in Habitat-Lab simulated indoor environments.

Description

This workflow covers the standard procedure for training a reinforcement learning agent to navigate from a starting position to a target coordinate (PointGoal Navigation) within photorealistic 3D indoor scenes. It leverages PPO for single-GPU setups or DD-PPO for distributed multi-GPU training, with a ResNet-based visual encoder processing RGB/depth sensor observations and an RNN state encoder maintaining temporal context. The process covers environment setup, configuration composition via Hydra, policy network initialization, distributed training with gradient synchronization, checkpointing, and evaluation with standard navigation metrics (SPL, Success Rate, Distance to Goal).

Usage

Execute this workflow when you have a PointNav episode dataset (e.g., Gibson, HM3D, or Matterport3D scenes) and need to train an embodied agent to navigate to specified coordinates using visual observations. This is the foundational training pipeline in Habitat-Lab and is typically the first workflow new users encounter.

Execution Steps

Step 1: Environment Setup

Install Habitat-Sim via conda with physics support, then pip-install habitat-lab and habitat-baselines packages. Download scene datasets (e.g., Gibson, HM3D) and corresponding PointNav episode datasets using the habitat-sim data download utility. Verify the installation by running the example script.

Key considerations:

Habitat-Sim must be installed with bullet physics support for realistic interactions
Scene data must be placed under the expected `data/` directory structure
Episode datasets define start positions and goal coordinates for training

Step 2: Configuration Composition

Select or compose a Hydra configuration file that defines the task (PointNav), dataset, simulator settings, sensor setup (RGB, depth, or RGBD), and training hyperparameters. The configuration system uses structured dataclasses for type safety and YAML composition for experiment-specific overrides.

Key considerations:

Use `ppo_pointnav_example.yaml` for quick testing or `ppo_pointnav.yaml` for production training
DD-PPO configs (`ddppo_pointnav.yaml`) add distributed training parameters
Sensor setups are composable via Hydra defaults (e.g., `rgbd_agent`, `depth_agent`)
Override hyperparameters via command line using Hydra syntax

Step 3: Policy Network Initialization

The trainer instantiates the policy network consisting of a visual encoder (ResNet or SimpleCNN), an RNN state encoder (GRU/LSTM) for temporal context, and an action distribution head. The observation space is determined from the environment specification, and observation transforms (resize, center-crop) are applied.

Key considerations:

ResNet encoder provides stronger visual features but requires more memory
The RNN state encoder maintains belief state across timesteps
Observation transforms must match between training and evaluation

Step 4: Distributed Process Setup

For DD-PPO training, initialize the distributed process group. Each GPU worker runs independent simulation environments. On SLURM clusters, the launcher script handles multi-node coordination. For single-node multi-GPU, use the provided shell script with torch.distributed.launch.

Key considerations:

DD-PPO uses decentralized gradient synchronization (allreduce) without a parameter server
Each worker independently collects rollouts before synchronizing gradients
SLURM launcher handles node discovery and process group initialization automatically
Single-GPU training skips this step entirely

Step 5: Rollout Collection and Training

The training loop alternates between collecting experience rollouts (agent-environment interactions stored in rollout storage) and performing PPO policy updates. Each rollout collects a fixed number of environment steps, computes advantages using Generalized Advantage Estimation (GAE), and performs multiple epochs of minibatch gradient updates on the clipped surrogate objective.

Key considerations:

Rollout storage buffers observations, actions, rewards, and value predictions
GAE lambda and discount factor control the bias-variance tradeoff
PPO clip ratio constrains policy updates for stability
Training metrics are logged to TensorBoard or Weights & Biases

Step 6: Checkpointing and Evaluation

Periodically save model checkpoints during training. For evaluation, load a trained checkpoint and run the agent through held-out episodes, collecting navigation metrics including Success Rate, SPL (Success weighted by Path Length), SoftSPL, and Distance to Goal. Evaluation generates video recordings of agent trajectories.

Key considerations:

Checkpoints include model weights, optimizer state, and training progress
Resume training from interruptions using the resume state mechanism
Evaluation uses greedy action selection (no exploration noise)
Standard metrics follow the Habitat Challenge evaluation protocol

Execution Diagram

GitHub URL

Workflow Repository