Principle:Isaac sim IsaacGymEnvs Asymmetric Actor Critic

**Metadata**
Knowledge Sources	Asymmetric Actor Critic DeXtreme IsaacGymEnvs
Domains	Reinforcement_Learning Sim_to_Real
Last Updated	2026-02-15 00:00 GMT

Overview

Training methodology where the critic network receives privileged simulation information not available to the actor, enabling better value estimation while maintaining deployable actor observations. This is a core technique for sim-to-real policy training in the DeXtreme pipeline.

Description

In asymmetric actor-critic, the actor and critic operate on different observation spaces:

Actor observations (the policy input) contain only information available on real hardware:

Joint positions (DOF positions, 16 dims)
Fingertip poses (position + quaternion + linear/angular velocity, 13 per fingertip, 4 fingertips)
Object pose estimate (potentially noisy and delayed, mimicking camera-based estimation)
Goal pose
Previous actions
Rotation distance to goal

Critic observations (the value function input) include all actor observations plus privileged information:

True object velocities (not available from cameras on real hardware)
Force/torque sensor readings at full resolution
Randomized parameter values (cube random params, hand random params)
Gravity vector (which may be randomized)
Stochastic delay parameters (observation delay probability, action latency, pose refresh rate)
Affine noise parameters (the actual noise coefficients applied to observations and actions)
Random body forces applied to the object

This asymmetry is implemented via dictionary observations with distinct keys: the actor receives a subset of observation channels, while the critic receives all channels. The num_obs_dict in AllegroHandDextreme defines the dimensionality of each channel.

Usage

Enable asymmetric observations by setting env.asymmetric_observations: True in the task YAML config. The training algorithm (rl_games) must be configured to use the dict observation space, passing the full dictionary to the critic and only the actor-relevant subset to the policy.

Theoretical Basis

The theoretical foundation is the asymmetric information formulation of actor-critic methods:

Policy:         pi(a | o_actor)     -- maps limited observations to actions
Value function: V(o_critic)         -- maps full state to value estimate

where o_actor is a strict subset of o_critic

Training:
    1. Collect trajectories using pi(a | o_actor)
    2. Compute advantages using V(o_critic) -- better value estimates
    3. Update pi via policy gradient with the improved advantages
    4. Update V to minimize value prediction error on o_critic

Deployment:
    Only pi(a | o_actor) is needed -- critic is discarded

The key insight is that the policy gradient theorem only requires the advantages to be accurate, not the observations used to compute them. By giving the critic access to privileged information, we obtain better advantage estimates, which leads to:

Faster training: The critic can more accurately distinguish good states from bad states, reducing variance in the policy gradient.
Better credit assignment: The critic can attribute rewards to the correct factors (e.g., knowing the true object velocity helps distinguish skill from luck).
No deployment cost: The actor policy is trained to operate purely on real-hardware-compatible observations.

For DeXtreme specifically, the critic receives the actual randomization parameter values as input. This allows the value function to condition on the difficulty of the current environment configuration, providing better baseline estimates for each environment.

Related Pages

Implementation:Isaac_sim_IsaacGymEnvs_AllegroHandDextreme_Observations

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment