Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:Facebookresearch Habitat lab Rollout Collection and Training

From Leeroopedia
Knowledge Sources
Domains Reinforcement_Learning, Optimization
Last Updated 2026-02-15 02:00 GMT

Overview

The core RL training loop that alternates between collecting on-policy rollouts from vectorized environments and updating policy parameters using Proximal Policy Optimization.

Description

Rollout Collection and Training implements the standard on-policy RL training loop: (1) the policy interacts with vectorized environments to collect a fixed number of steps of experience (rollouts), (2) advantage estimates are computed using Generalized Advantage Estimation (GAE), and (3) the policy is updated using PPO's clipped surrogate objective over multiple epochs of mini-batch updates.

PPO's clipped objective prevents destructively large policy updates by constraining the ratio of new-to-old action probabilities. Combined with value function clipping and entropy regularization, this produces stable training for high-dimensional observation spaces common in embodied AI.

Usage

This is the central training loop for all PPO-based Habitat agents. It applies to PointNav, ObjectNav, and any other task trained with PPO or DD-PPO.

Theoretical Basis

The PPO clipped surrogate objective:

LCLIP(θ)=𝔼t[min(rt(θ)A^t,clip(rt(θ),1ϵ,1+ϵ)A^t)]

Where rt(θ)=πθ(at|st)πθold(at|st)

The full loss combines policy, value, and entropy terms:

L=LCLIPc1LVF+c2H[π]

Training loop pseudo-code:

# Abstract PPO training loop
while not done:
    # 1. Collect rollouts
    for step in range(num_steps):
        action = policy.act(observation)
        observation, reward, done = env.step(action)
        rollout_buffer.insert(observation, action, reward)

    # 2. Compute advantages (GAE)
    advantages = compute_gae(rollout_buffer, gamma, tau)

    # 3. PPO update
    for epoch in range(ppo_epochs):
        for batch in rollout_buffer.batches(num_mini_batch):
            ratio = new_probs / old_probs
            clipped_ratio = clip(ratio, 1-epsilon, 1+epsilon)
            policy_loss = -min(ratio * advantages, clipped_ratio * advantages)
            value_loss = mse(predicted_value, returns)
            loss = policy_loss + value_coef * value_loss - entropy_coef * entropy
            optimizer.step(loss)

Related Pages

Implemented By

Uses Heuristic

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment