Principle:Isaac sim IsaacGymEnvs Asymmetric Actor Critic
| Knowledge Sources | |
|---|---|
| Domains | |
| Last Updated | 2026-02-15 00:00 GMT |
Overview
Training methodology where the critic network receives privileged simulation information not available to the actor, enabling better value estimation while maintaining deployable actor observations. This is a core technique for sim-to-real policy training in the DeXtreme pipeline.
Description
In asymmetric actor-critic, the actor and critic operate on different observation spaces:
Actor observations (the policy input) contain only information available on real hardware:
- Joint positions (DOF positions, 16 dims)
- Fingertip poses (position + quaternion + linear/angular velocity, 13 per fingertip, 4 fingertips)
- Object pose estimate (potentially noisy and delayed, mimicking camera-based estimation)
- Goal pose
- Previous actions
- Rotation distance to goal
Critic observations (the value function input) include all actor observations plus privileged information:
- True object velocities (not available from cameras on real hardware)
- Force/torque sensor readings at full resolution
- Randomized parameter values (cube random params, hand random params)
- Gravity vector (which may be randomized)
- Stochastic delay parameters (observation delay probability, action latency, pose refresh rate)
- Affine noise parameters (the actual noise coefficients applied to observations and actions)
- Random body forces applied to the object
This asymmetry is implemented via dictionary observations with distinct keys: the actor receives a subset of observation channels, while the critic receives all channels. The num_obs_dict in AllegroHandDextreme defines the dimensionality of each channel.
Usage
Enable asymmetric observations by setting env.asymmetric_observations: True in the task YAML config. The training algorithm (rl_games) must be configured to use the dict observation space, passing the full dictionary to the critic and only the actor-relevant subset to the policy.
Theoretical Basis
The theoretical foundation is the asymmetric information formulation of actor-critic methods:
Policy: pi(a | o_actor) -- maps limited observations to actions
Value function: V(o_critic) -- maps full state to value estimate
where o_actor is a strict subset of o_critic
Training:
1. Collect trajectories using pi(a | o_actor)
2. Compute advantages using V(o_critic) -- better value estimates
3. Update pi via policy gradient with the improved advantages
4. Update V to minimize value prediction error on o_critic
Deployment:
Only pi(a | o_actor) is needed -- critic is discarded
The key insight is that the policy gradient theorem only requires the advantages to be accurate, not the observations used to compute them. By giving the critic access to privileged information, we obtain better advantage estimates, which leads to:
- Faster training: The critic can more accurately distinguish good states from bad states, reducing variance in the policy gradient.
- Better credit assignment: The critic can attribute rewards to the correct factors (e.g., knowing the true object velocity helps distinguish skill from luck).
- No deployment cost: The actor policy is trained to operate purely on real-hardware-compatible observations.
For DeXtreme specifically, the critic receives the actual randomization parameter values as input. This allows the value function to condition on the difficulty of the current environment configuration, providing better baseline estimates for each environment.