Principle:Facebookresearch Habitat lab High level Policy Training
| Knowledge Sources | |
|---|---|
| Domains | Hierarchical_RL, Reinforcement_Learning |
| Last Updated | 2026-02-15 02:00 GMT |
Overview
PPO-based training of a high-level policy that learns to select among pre-trained skills to solve multi-step rearrangement tasks.
Description
High-level Policy Training uses PPO to train only the high-level (meta) policy while keeping all low-level skills frozen. The high-level policy observes the environment state and outputs a skill selection (categorical action over the skill set). Rewards come from the overall task completion, encouraging the high-level policy to learn optimal skill sequencing.
The training uses skill-level transitions: a "step" in the high-level MDP corresponds to one complete skill execution, not one environment time step. This temporal abstraction accelerates learning by reducing the effective horizon.
Usage
Use after all low-level skills have been trained and assembled into a hierarchical policy. Only the high-level policy parameters are updated during this phase.
Theoretical Basis
The high-level PPO operates at the option (skill) level:
- Skill-level transitions: Each HL step spans the duration of one skill execution
- Skill-level rewards: Accumulated task reward over the skill's execution
- Temporal abstraction: Reduces effective horizon from thousands of env steps to tens of skill steps
- Frozen skills: Only HL policy gradients flow; skill weights are fixed
Where is the total reward accumulated during the k-th skill execution.