Principle:Facebookresearch Habitat lab Rollout Collection and Training

Knowledge Sources	PPO DD-PPO GAE Habitat-Lab
Domains	Reinforcement_Learning, Optimization
Last Updated	2026-02-15 02:00 GMT

Overview

The core RL training loop that alternates between collecting on-policy rollouts from vectorized environments and updating policy parameters using Proximal Policy Optimization.

Description

Rollout Collection and Training implements the standard on-policy RL training loop: (1) the policy interacts with vectorized environments to collect a fixed number of steps of experience (rollouts), (2) advantage estimates are computed using Generalized Advantage Estimation (GAE), and (3) the policy is updated using PPO's clipped surrogate objective over multiple epochs of mini-batch updates.

PPO's clipped objective prevents destructively large policy updates by constraining the ratio of new-to-old action probabilities. Combined with value function clipping and entropy regularization, this produces stable training for high-dimensional observation spaces common in embodied AI.

Usage

This is the central training loop for all PPO-based Habitat agents. It applies to PointNav, ObjectNav, and any other task trained with PPO or DD-PPO.

Theoretical Basis

The PPO clipped surrogate objective:

$L^{C L I P} (θ) = 𝔼_{t} [\min (r_{t} (θ) {\hat{A}}_{t}, clip (r_{t} (θ), 1 - ϵ, 1 + ϵ) {\hat{A}}_{t})]$

Where $r_{t} (θ) = \frac{π_{θ} (a_{t} | s_{t})}{π_{θ_{o l d}} (a_{t} | s_{t})}$

The full loss combines policy, value, and entropy terms:

$L = L^{C L I P} - c_{1} L^{V F} + c_{2} H [π]$

Training loop pseudo-code:

# Abstract PPO training loop
while not done:
    # 1. Collect rollouts
    for step in range(num_steps):
        action = policy.act(observation)
        observation, reward, done = env.step(action)
        rollout_buffer.insert(observation, action, reward)

    # 2. Compute advantages (GAE)
    advantages = compute_gae(rollout_buffer, gamma, tau)

    # 3. PPO update
    for epoch in range(ppo_epochs):
        for batch in rollout_buffer.batches(num_mini_batch):
            ratio = new_probs / old_probs
            clipped_ratio = clip(ratio, 1-epsilon, 1+epsilon)
            policy_loss = -min(ratio * advantages, clipped_ratio * advantages)
            value_loss = mse(predicted_value, returns)
            loss = policy_loss + value_coef * value_loss - entropy_coef * entropy
            optimizer.step(loss)

Related Pages

Implemented By

Implementation:Facebookresearch_Habitat_lab_PPOTrainer_train

Uses Heuristic

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment