Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Principle:Isaac sim IsaacGymEnvs Adversarial Motion Prior Training

From Leeroopedia
Knowledge Sources
Domains Reinforcement_Learning, Motion_Imitation
Last Updated 2026-02-15 11:00 GMT

Overview

Adversarial Motion Priors (AMP) combine a task-specific reward with a style reward derived from a discriminator trained on reference motion data, enabling agents to learn natural movement patterns while completing assigned tasks.

Description

Adversarial Motion Priors draw inspiration from Generative Adversarial Networks (GANs) by introducing a discriminator network that learns to distinguish between transitions produced by the agent's policy and transitions extracted from a reference motion dataset. The discriminator provides a style reward signal that encourages the agent to produce behaviors that are statistically indistinguishable from the reference motions. This approach eliminates the need for hand-crafted reward functions to specify movement style, replacing them with data-driven style objectives.

The training procedure alternates between two phases. In the first phase, the discriminator is updated to better classify agent-generated transitions versus reference transitions. In the second phase, the reinforcement learning policy is updated using a combined reward signal that blends the task reward (e.g., reaching a target, locomotion velocity) with the style reward from the discriminator. The style weight parameter controls the trade-off between task completion and motion naturalness.

A replay buffer plays a critical role in stabilizing training. Agent transitions are stored in the replay buffer and sampled alongside reference motion data to train the discriminator. This prevents the discriminator from overfitting to the most recent policy behavior and provides a more diverse training signal. The overall objective for the agent can be expressed as: reward = task_reward + style_weight * disc_reward, where disc_reward is derived from the discriminator's confidence that a transition came from the reference dataset.

Usage

Use Adversarial Motion Priors when you need an agent to perform a task while exhibiting natural, human-like or animal-like motion. This is particularly valuable in humanoid locomotion, character animation, and any domain where the quality of motion matters alongside task completion. AMP is preferred over direct motion tracking when the agent must adapt its movement to accomplish variable goals rather than reproducing a fixed motion clip exactly.

Theoretical Basis

The core equations governing AMP training are:

Discriminator objective:

L_disc = -E_ref[log(D(s, s'))] - E_agent[log(1 - D(s, s'))]

Style reward:

r_style = -log(1 - D(s, s'))

Combined reward:

r_total = r_task + w_style * r_style

where D(s, s') is the discriminator output for a state transition, E_ref denotes expectation over reference motion transitions, and E_agent denotes expectation over agent-generated transitions.

# Abstract AMP Training Algorithm (pseudo-code)

def amp_training_loop(policy, discriminator, replay_buffer, motion_lib, num_iterations):
    for iteration in range(num_iterations):
        # Step 1: Collect agent transitions using current policy
        agent_transitions = collect_rollouts(policy, environment)
        replay_buffer.store(agent_transitions)

        # Step 2: Sample reference transitions from motion library
        reference_transitions = motion_lib.sample_transitions(batch_size)

        # Step 3: Sample historical agent transitions from replay buffer
        agent_samples = replay_buffer.sample(batch_size)

        # Step 4: Update discriminator
        disc_loss = compute_discriminator_loss(
            discriminator, agent_samples, reference_transitions
        )
        discriminator.update(disc_loss)

        # Step 5: Compute combined reward
        task_reward = compute_task_reward(agent_transitions)
        style_reward = compute_style_reward(discriminator, agent_transitions)
        total_reward = task_reward + style_weight * style_reward

        # Step 6: Update policy using PPO with combined reward
        policy.update(agent_transitions, total_reward)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment