Principle:Haosulab ManiSkill PPO Agent Architecture
| Field | Value |
|---|---|
| principle_name | Haosulab_ManiSkill_PPO_Agent_Architecture |
| overview | Actor-Critic neural network architecture for Proximal Policy Optimization in continuous action spaces |
| domains | Reinforcement_Learning, Robotics |
| last_updated | 2026-02-15 |
| related_pages | Implementation:Haosulab_ManiSkill_PPO_Agent_Network |
Overview
Description
The PPO Agent Architecture defines the neural network structure used by a Proximal Policy Optimization agent to interact with continuous-action simulation environments. The architecture follows the Actor-Critic paradigm, where two separate networks share no parameters:
- Actor (Policy) Network: Maps observations to a probability distribution over continuous actions. For continuous control, this is a Gaussian (Normal) distribution parameterized by a learned mean vector and a state-independent log-standard-deviation.
- Critic (Value) Network: Maps observations to a scalar estimate of the state value function V(s), used for computing advantages during policy optimization.
Both networks are multi-layer perceptrons (MLPs) with identical hidden layer structure but separate parameters. This separation allows the critic to learn value estimates without interfering with policy gradient updates, and vice versa.
Key architectural choices include:
- Separate actor and critic networks: Unlike shared-backbone architectures, separate networks avoid gradient interference between policy and value objectives. This is particularly important in continuous control where the value function and policy can have very different optimization landscapes.
- Learned state-independent log-standard-deviation: The action distribution standard deviation is parameterized as a learnable
nn.Parameterrather than being output by the network. This means the exploration noise is the same regardless of the current state, which simplifies optimization and is empirically effective for many continuous control tasks. The parameter is initialized at -0.5, corresponding to a standard deviation of approximately 0.607.
- Orthogonal weight initialization: All linear layers are initialized using orthogonal initialization with a gain of sqrt(2) for hidden layers. The final actor layer uses a much smaller gain (0.01 * sqrt(2)) to produce near-zero initial actions, promoting cautious initial exploration. Biases are initialized to zero.
- Tanh activations: Hidden layers use the hyperbolic tangent activation function, which bounds outputs to [-1, 1]. This is a classical choice for RL policy networks that helps with gradient flow and is compatible with orthogonal initialization.
Usage
Use this architecture when:
- Training PPO agents on continuous-action ManiSkill environments with state-based observations
- The observation space is a flat vector (e.g.,
obs_mode="state") - The action space is a continuous
Boxspace (e.g., joint position deltas)
For visual observations (RGBD, pointcloud), the MLP architecture would need to be replaced with CNN or ViT-based encoders, while the overall actor-critic structure remains the same.
Theoretical Basis
Actor-Critic Methods: The actor-critic framework combines policy gradient methods (actor) with value function approximation (critic). The actor produces actions, while the critic evaluates the quality of states. The critic's value estimates are used to compute advantages, which reduce variance in policy gradient estimates compared to Monte Carlo returns.
Gaussian Policies for Continuous Control: In continuous action spaces, the policy is represented as a multivariate Gaussian distribution. Given an observation x:
- The actor network outputs the mean vector:
mu = actor_mean(x) - The log-standard-deviation is a learned parameter:
log_sigma - The standard deviation is:
sigma = exp(log_sigma) - Actions are sampled as:
a ~ N(mu, diag(sigma^2))
The diagonal covariance assumption (independent action dimensions) simplifies computation while being sufficient for most robotics tasks.
Log-Probability and Entropy: For the PPO loss computation:
- Log-probability:
log pi(a|s) = sum_i log N(a_i; mu_i, sigma_i)(summed across action dimensions) - Entropy:
H(pi(.|s)) = sum_i (0.5 * log(2*pi*e*sigma_i^2))(used as an exploration bonus)
Orthogonal Initialization (Saxe et al., 2014): Initializing weight matrices as orthogonal matrices preserves the norm of activations through the network, preventing vanishing or exploding gradients at the start of training. The gain factor controls the scale:
- Hidden layers:
gain = sqrt(2)(standard for Tanh activations) - Output layer (actor):
gain = 0.01 * sqrt(2)(small initial actions for cautious exploration) - Output layer (critic):
gain = 1.0(default for value outputs)
| Component | Architecture | Input | Output | Initialization |
|---|---|---|---|---|
| Critic | 4-layer MLP (in->256->256->256->1) with Tanh | obs vector (obs_dim,) | scalar value V(s) | Orthogonal, gain=sqrt(2); final gain=1.0 |
| Actor Mean | 4-layer MLP (in->256->256->256->act_dim) with Tanh | obs vector (obs_dim,) | action mean vector (act_dim,) | Orthogonal, gain=sqrt(2); final gain=0.01*sqrt(2) |
| Actor Log-Std | nn.Parameter | N/A (learned constant) | log-std vector (1, act_dim) | Initialized at -0.5 |
Related Pages
- Implementation:Haosulab_ManiSkill_PPO_Agent_Network -- Concrete PyTorch implementation of this architecture
- Principle:Haosulab_ManiSkill_PPO_Policy_Optimization -- How this architecture is trained via the PPO algorithm
- Principle:Haosulab_ManiSkill_GPU_Parallelized_Rollout -- How the agent collects experience from parallel environments