Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:Facebookresearch Habitat lab High level Policy Training

From Leeroopedia
Knowledge Sources
Domains Hierarchical_RL, Reinforcement_Learning
Last Updated 2026-02-15 02:00 GMT

Overview

PPO-based training of a high-level policy that learns to select among pre-trained skills to solve multi-step rearrangement tasks.

Description

High-level Policy Training uses PPO to train only the high-level (meta) policy while keeping all low-level skills frozen. The high-level policy observes the environment state and outputs a skill selection (categorical action over the skill set). Rewards come from the overall task completion, encouraging the high-level policy to learn optimal skill sequencing.

The training uses skill-level transitions: a "step" in the high-level MDP corresponds to one complete skill execution, not one environment time step. This temporal abstraction accelerates learning by reducing the effective horizon.

Usage

Use after all low-level skills have been trained and assembled into a hierarchical policy. Only the high-level policy parameters are updated during this phase.

Theoretical Basis

The high-level PPO operates at the option (skill) level:

  1. Skill-level transitions: Each HL step spans the duration of one skill execution
  2. Skill-level rewards: Accumulated task reward over the skill's execution
  3. Temporal abstraction: Reduces effective horizon from thousands of env steps to tens of skill steps
  4. Frozen skills: Only HL policy gradients flow; skill weights are fixed

JHL(θ)=𝔼[k=0KγkRkskill]

Where Rkskill is the total reward accumulated during the k-th skill execution.

Related Pages

Implemented By

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment