Principle:Facebookresearch Habitat lab Rollout Collection and Training
| Knowledge Sources | |
|---|---|
| Domains | Reinforcement_Learning, Optimization |
| Last Updated | 2026-02-15 02:00 GMT |
Overview
The core RL training loop that alternates between collecting on-policy rollouts from vectorized environments and updating policy parameters using Proximal Policy Optimization.
Description
Rollout Collection and Training implements the standard on-policy RL training loop: (1) the policy interacts with vectorized environments to collect a fixed number of steps of experience (rollouts), (2) advantage estimates are computed using Generalized Advantage Estimation (GAE), and (3) the policy is updated using PPO's clipped surrogate objective over multiple epochs of mini-batch updates.
PPO's clipped objective prevents destructively large policy updates by constraining the ratio of new-to-old action probabilities. Combined with value function clipping and entropy regularization, this produces stable training for high-dimensional observation spaces common in embodied AI.
Usage
This is the central training loop for all PPO-based Habitat agents. It applies to PointNav, ObjectNav, and any other task trained with PPO or DD-PPO.
Theoretical Basis
The PPO clipped surrogate objective:
Where
The full loss combines policy, value, and entropy terms:
Training loop pseudo-code:
# Abstract PPO training loop
while not done:
# 1. Collect rollouts
for step in range(num_steps):
action = policy.act(observation)
observation, reward, done = env.step(action)
rollout_buffer.insert(observation, action, reward)
# 2. Compute advantages (GAE)
advantages = compute_gae(rollout_buffer, gamma, tau)
# 3. PPO update
for epoch in range(ppo_epochs):
for batch in rollout_buffer.batches(num_mini_batch):
ratio = new_probs / old_probs
clipped_ratio = clip(ratio, 1-epsilon, 1+epsilon)
policy_loss = -min(ratio * advantages, clipped_ratio * advantages)
value_loss = mse(predicted_value, returns)
loss = policy_loss + value_coef * value_loss - entropy_coef * entropy
optimizer.step(loss)