Principle:Haosulab ManiSkill PPO Agent Architecture

Field	Value
principle_name	Haosulab_ManiSkill_PPO_Agent_Architecture
overview	Actor-Critic neural network architecture for Proximal Policy Optimization in continuous action spaces
domains	Reinforcement_Learning, Robotics
last_updated	2026-02-15
related_pages	Implementation:Haosulab_ManiSkill_PPO_Agent_Network

Overview

Description

The PPO Agent Architecture defines the neural network structure used by a Proximal Policy Optimization agent to interact with continuous-action simulation environments. The architecture follows the Actor-Critic paradigm, where two separate networks share no parameters:

Actor (Policy) Network: Maps observations to a probability distribution over continuous actions. For continuous control, this is a Gaussian (Normal) distribution parameterized by a learned mean vector and a state-independent log-standard-deviation.
Critic (Value) Network: Maps observations to a scalar estimate of the state value function V(s), used for computing advantages during policy optimization.

Both networks are multi-layer perceptrons (MLPs) with identical hidden layer structure but separate parameters. This separation allows the critic to learn value estimates without interfering with policy gradient updates, and vice versa.

Key architectural choices include:

Separate actor and critic networks: Unlike shared-backbone architectures, separate networks avoid gradient interference between policy and value objectives. This is particularly important in continuous control where the value function and policy can have very different optimization landscapes.

Learned state-independent log-standard-deviation: The action distribution standard deviation is parameterized as a learnable nn.Parameter rather than being output by the network. This means the exploration noise is the same regardless of the current state, which simplifies optimization and is empirically effective for many continuous control tasks. The parameter is initialized at -0.5, corresponding to a standard deviation of approximately 0.607.

Orthogonal weight initialization: All linear layers are initialized using orthogonal initialization with a gain of sqrt(2) for hidden layers. The final actor layer uses a much smaller gain (0.01 * sqrt(2)) to produce near-zero initial actions, promoting cautious initial exploration. Biases are initialized to zero.

Tanh activations: Hidden layers use the hyperbolic tangent activation function, which bounds outputs to [-1, 1]. This is a classical choice for RL policy networks that helps with gradient flow and is compatible with orthogonal initialization.

Usage

Use this architecture when:

Training PPO agents on continuous-action ManiSkill environments with state-based observations
The observation space is a flat vector (e.g., obs_mode="state")
The action space is a continuous Box space (e.g., joint position deltas)

For visual observations (RGBD, pointcloud), the MLP architecture would need to be replaced with CNN or ViT-based encoders, while the overall actor-critic structure remains the same.

Theoretical Basis

Actor-Critic Methods: The actor-critic framework combines policy gradient methods (actor) with value function approximation (critic). The actor produces actions, while the critic evaluates the quality of states. The critic's value estimates are used to compute advantages, which reduce variance in policy gradient estimates compared to Monte Carlo returns.

Gaussian Policies for Continuous Control: In continuous action spaces, the policy is represented as a multivariate Gaussian distribution. Given an observation x:

The actor network outputs the mean vector: mu = actor_mean(x)
The log-standard-deviation is a learned parameter: log_sigma
The standard deviation is: sigma = exp(log_sigma)
Actions are sampled as: a ~ N(mu, diag(sigma^2))

The diagonal covariance assumption (independent action dimensions) simplifies computation while being sufficient for most robotics tasks.

Log-Probability and Entropy: For the PPO loss computation:

Log-probability: log pi(a|s) = sum_i log N(a_i; mu_i, sigma_i) (summed across action dimensions)
Entropy: H(pi(.|s)) = sum_i (0.5 * log(2*pi*e*sigma_i^2)) (used as an exploration bonus)

Orthogonal Initialization (Saxe et al., 2014): Initializing weight matrices as orthogonal matrices preserves the norm of activations through the network, preventing vanishing or exploding gradients at the start of training. The gain factor controls the scale:

Hidden layers: gain = sqrt(2) (standard for Tanh activations)
Output layer (actor): gain = 0.01 * sqrt(2) (small initial actions for cautious exploration)
Output layer (critic): gain = 1.0 (default for value outputs)

Network Architecture Summary
Component	Architecture	Input	Output	Initialization
Critic	4-layer MLP (in->256->256->256->1) with Tanh	obs vector (obs_dim,)	scalar value V(s)	Orthogonal, gain=sqrt(2); final gain=1.0
Actor Mean	4-layer MLP (in->256->256->256->act_dim) with Tanh	obs vector (obs_dim,)	action mean vector (act_dim,)	Orthogonal, gain=sqrt(2); final gain=0.01*sqrt(2)
Actor Log-Std	nn.Parameter	N/A (learned constant)	log-std vector (1, act_dim)	Initialized at -0.5

Related Pages

Implementation:Haosulab_ManiSkill_PPO_Agent_Network -- Concrete PyTorch implementation of this architecture
Principle:Haosulab_ManiSkill_PPO_Policy_Optimization -- How this architecture is trained via the PPO algorithm
Principle:Haosulab_ManiSkill_GPU_Parallelized_Rollout -- How the agent collects experience from parallel environments

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment