Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Principle:Isaac sim IsaacGymEnvs Hierarchical Reinforcement Learning

From Leeroopedia
Knowledge Sources
Domains Hierarchical_Reinforcement_Learning, Motion_Control
Last Updated 2026-02-15 11:00 GMT

Overview

Hierarchical Reinforcement Learning (HRL) decomposes control into multiple levels where a high-level controller selects latent goals or actions and a pre-trained low-level controller (LLC) translates them into joint-level motor commands, enabling temporal abstraction and reuse of learned motor skills.

Description

Standard (flat) reinforcement learning requires a single policy to map observations directly to low-level actions at every simulation timestep. For complex motor tasks -- such as humanoid locomotion, object manipulation, or multi-stage behaviors -- this flat approach faces challenges including long time horizons, sparse rewards, and an enormous action space. Hierarchical Reinforcement Learning addresses these challenges by decomposing the control problem into at least two levels: a high-level policy that reasons about abstract goals, and a low-level controller (LLC) that executes concrete motor commands to achieve those goals.

The low-level controller is typically pre-trained on a repertoire of motor skills, often using techniques such as Adversarial Motion Priors or direct motion tracking. Once trained, the LLC is frozen -- its weights are no longer updated. The LLC takes as input the current proprioceptive state of the agent along with a latent code (a compact vector in a learned latent space) that specifies what behavior to produce. The LLC then outputs joint torques or position targets that realize the specified behavior. This pre-training phase ensures the LLC can produce a diverse set of natural, physically plausible movements.

The high-level policy operates at a coarser temporal resolution -- it selects a new latent code every k simulation steps rather than every step. This temporal abstraction dramatically reduces the effective horizon length and simplifies the credit assignment problem. The high-level policy receives task-relevant observations (e.g., target location, object state, terrain information) and outputs a latent code that is held constant for k steps while the LLC executes it. The high-level policy is trained with reinforcement learning using the task reward, but it never needs to reason about individual joint movements. This separation of concerns enables skill reuse -- the same pre-trained LLC can be paired with different high-level policies for different tasks, and the high-level policy can focus purely on what to do rather than how to do it.

Usage

Use Hierarchical Reinforcement Learning when the task requires complex, multi-step behavior that combines locomotion with task-specific objectives. HRL is particularly effective when you have access to a pre-trained low-level motor skill controller (e.g., from AMP training) and want to leverage it for a new task without retraining from scratch. It is also valuable when the action space is high-dimensional (many DOFs) and the task reward is sparse, as the temporal abstraction provided by the hierarchy helps bridge the gap between actions and outcomes.

Theoretical Basis

Temporal abstraction:

High-level policy acts every k steps: z_t = pi_high(o_t) for t = 0, k, 2k, ...

Low-level controller acts every step: a_t = pi_low(s_t, z_t) for all t

Objective decomposition:

High-level objective: max E[sum of r_task(s_t, z_t)]

Low-level objective (pre-training): max E[sum of r_style(s_t, a_t)]

Latent code as communication channel:

z in R^d, where d is the latent dimension (typically 16-64)

# Abstract Hierarchical Reinforcement Learning (pseudo-code)

class LowLevelController:
    """Pre-trained motor skill controller (frozen during HRL training)."""
    def __init__(self, pretrained_weights, latent_dim):
        self.network = load_network(pretrained_weights)
        self.latent_dim = latent_dim
        # Freeze all parameters
        self.network.requires_grad = False

    def get_action(self, proprioceptive_state, latent_code):
        """Map state + latent code to joint-level action."""
        input = concatenate(proprioceptive_state, latent_code)
        joint_actions = self.network.forward(input)
        return joint_actions

class HighLevelPolicy:
    """Task-level policy that outputs latent codes for the LLC."""
    def __init__(self, obs_dim, latent_dim, action_interval_k):
        self.network = create_policy_network(obs_dim, latent_dim)
        self.k = action_interval_k    # temporal abstraction factor
        self.current_latent = None
        self.steps_since_update = 0

    def get_latent_code(self, task_observation):
        """Select a new latent code every k steps."""
        if self.steps_since_update % self.k == 0:
            self.current_latent = self.network.forward(task_observation)
            self.steps_since_update = 0
        self.steps_since_update += 1
        return self.current_latent

class HRLAgent:
    """Combines high-level policy with pre-trained low-level controller."""
    def __init__(self, high_level_policy, low_level_controller):
        self.high_level = high_level_policy
        self.low_level = low_level_controller

    def act(self, task_observation, proprioceptive_state):
        # High-level selects abstract goal (latent code)
        latent_code = self.high_level.get_latent_code(task_observation)
        # Low-level translates to motor commands
        joint_actions = self.low_level.get_action(proprioceptive_state, latent_code)
        return joint_actions

def train_hrl(hrl_agent, environment, num_iterations):
    """Train only the high-level policy; LLC remains frozen."""
    for iteration in range(num_iterations):
        # Collect rollouts with temporal abstraction
        trajectories = []
        obs = environment.reset()
        for step in range(episode_length):
            action = hrl_agent.act(obs.task, obs.proprioceptive)
            next_obs, reward, done = environment.step(action)
            trajectories.append((obs, action, reward, done))
            obs = next_obs

        # Update only the high-level policy using PPO
        hrl_agent.high_level.update(trajectories, task_reward)
        # Low-level controller weights remain unchanged

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment