Principle:Isaac sim IsaacGymEnvs Hierarchical Reinforcement Learning
| Knowledge Sources | |
|---|---|
| Domains | Hierarchical_Reinforcement_Learning, Motion_Control |
| Last Updated | 2026-02-15 11:00 GMT |
Overview
Hierarchical Reinforcement Learning (HRL) decomposes control into multiple levels where a high-level controller selects latent goals or actions and a pre-trained low-level controller (LLC) translates them into joint-level motor commands, enabling temporal abstraction and reuse of learned motor skills.
Description
Standard (flat) reinforcement learning requires a single policy to map observations directly to low-level actions at every simulation timestep. For complex motor tasks -- such as humanoid locomotion, object manipulation, or multi-stage behaviors -- this flat approach faces challenges including long time horizons, sparse rewards, and an enormous action space. Hierarchical Reinforcement Learning addresses these challenges by decomposing the control problem into at least two levels: a high-level policy that reasons about abstract goals, and a low-level controller (LLC) that executes concrete motor commands to achieve those goals.
The low-level controller is typically pre-trained on a repertoire of motor skills, often using techniques such as Adversarial Motion Priors or direct motion tracking. Once trained, the LLC is frozen -- its weights are no longer updated. The LLC takes as input the current proprioceptive state of the agent along with a latent code (a compact vector in a learned latent space) that specifies what behavior to produce. The LLC then outputs joint torques or position targets that realize the specified behavior. This pre-training phase ensures the LLC can produce a diverse set of natural, physically plausible movements.
The high-level policy operates at a coarser temporal resolution -- it selects a new latent code every k simulation steps rather than every step. This temporal abstraction dramatically reduces the effective horizon length and simplifies the credit assignment problem. The high-level policy receives task-relevant observations (e.g., target location, object state, terrain information) and outputs a latent code that is held constant for k steps while the LLC executes it. The high-level policy is trained with reinforcement learning using the task reward, but it never needs to reason about individual joint movements. This separation of concerns enables skill reuse -- the same pre-trained LLC can be paired with different high-level policies for different tasks, and the high-level policy can focus purely on what to do rather than how to do it.
Usage
Use Hierarchical Reinforcement Learning when the task requires complex, multi-step behavior that combines locomotion with task-specific objectives. HRL is particularly effective when you have access to a pre-trained low-level motor skill controller (e.g., from AMP training) and want to leverage it for a new task without retraining from scratch. It is also valuable when the action space is high-dimensional (many DOFs) and the task reward is sparse, as the temporal abstraction provided by the hierarchy helps bridge the gap between actions and outcomes.
Theoretical Basis
Temporal abstraction:
High-level policy acts every k steps: z_t = pi_high(o_t) for t = 0, k, 2k, ...
Low-level controller acts every step: a_t = pi_low(s_t, z_t) for all t
Objective decomposition:
High-level objective: max E[sum of r_task(s_t, z_t)]
Low-level objective (pre-training): max E[sum of r_style(s_t, a_t)]
Latent code as communication channel:
z in R^d, where d is the latent dimension (typically 16-64)
# Abstract Hierarchical Reinforcement Learning (pseudo-code)
class LowLevelController:
"""Pre-trained motor skill controller (frozen during HRL training)."""
def __init__(self, pretrained_weights, latent_dim):
self.network = load_network(pretrained_weights)
self.latent_dim = latent_dim
# Freeze all parameters
self.network.requires_grad = False
def get_action(self, proprioceptive_state, latent_code):
"""Map state + latent code to joint-level action."""
input = concatenate(proprioceptive_state, latent_code)
joint_actions = self.network.forward(input)
return joint_actions
class HighLevelPolicy:
"""Task-level policy that outputs latent codes for the LLC."""
def __init__(self, obs_dim, latent_dim, action_interval_k):
self.network = create_policy_network(obs_dim, latent_dim)
self.k = action_interval_k # temporal abstraction factor
self.current_latent = None
self.steps_since_update = 0
def get_latent_code(self, task_observation):
"""Select a new latent code every k steps."""
if self.steps_since_update % self.k == 0:
self.current_latent = self.network.forward(task_observation)
self.steps_since_update = 0
self.steps_since_update += 1
return self.current_latent
class HRLAgent:
"""Combines high-level policy with pre-trained low-level controller."""
def __init__(self, high_level_policy, low_level_controller):
self.high_level = high_level_policy
self.low_level = low_level_controller
def act(self, task_observation, proprioceptive_state):
# High-level selects abstract goal (latent code)
latent_code = self.high_level.get_latent_code(task_observation)
# Low-level translates to motor commands
joint_actions = self.low_level.get_action(proprioceptive_state, latent_code)
return joint_actions
def train_hrl(hrl_agent, environment, num_iterations):
"""Train only the high-level policy; LLC remains frozen."""
for iteration in range(num_iterations):
# Collect rollouts with temporal abstraction
trajectories = []
obs = environment.reset()
for step in range(episode_length):
action = hrl_agent.act(obs.task, obs.proprioceptive)
next_obs, reward, done = environment.step(action)
trajectories.append((obs, action, reward, done))
obs = next_obs
# Update only the high-level policy using PPO
hrl_agent.high_level.update(trajectories, task_reward)
# Low-level controller weights remain unchanged