Principle:Facebookresearch Habitat lab High level Policy Training

Knowledge Sources	Habitat 2.0 PPO Habitat-Lab
Domains	Hierarchical_RL, Reinforcement_Learning
Last Updated	2026-02-15 02:00 GMT

Overview

PPO-based training of a high-level policy that learns to select among pre-trained skills to solve multi-step rearrangement tasks.

Description

High-level Policy Training uses PPO to train only the high-level (meta) policy while keeping all low-level skills frozen. The high-level policy observes the environment state and outputs a skill selection (categorical action over the skill set). Rewards come from the overall task completion, encouraging the high-level policy to learn optimal skill sequencing.

The training uses skill-level transitions: a "step" in the high-level MDP corresponds to one complete skill execution, not one environment time step. This temporal abstraction accelerates learning by reducing the effective horizon.

Usage

Use after all low-level skills have been trained and assembled into a hierarchical policy. Only the high-level policy parameters are updated during this phase.

Theoretical Basis

The high-level PPO operates at the option (skill) level:

Skill-level transitions: Each HL step spans the duration of one skill execution
Skill-level rewards: Accumulated task reward over the skill's execution
Temporal abstraction: Reduces effective horizon from thousands of env steps to tens of skill steps
Frozen skills: Only HL policy gradients flow; skill weights are fixed

$J_{H L} (θ) = 𝔼 [\sum_{k = 0}^{K} γ^{k} R_{k}^{s k i l l}]$

Where $R_{k}^{s k i l l}$ is the total reward accumulated during the k-th skill execution.

Related Pages

Implemented By

Implementation:Facebookresearch_Habitat_lab_HRLPPO_update

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment