Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:Haosulab ManiSkill PPO Agent Architecture

From Leeroopedia
Field Value
principle_name Haosulab_ManiSkill_PPO_Agent_Architecture
overview Actor-Critic neural network architecture for Proximal Policy Optimization in continuous action spaces
domains Reinforcement_Learning, Robotics
last_updated 2026-02-15
related_pages Implementation:Haosulab_ManiSkill_PPO_Agent_Network

Overview

Description

The PPO Agent Architecture defines the neural network structure used by a Proximal Policy Optimization agent to interact with continuous-action simulation environments. The architecture follows the Actor-Critic paradigm, where two separate networks share no parameters:

  • Actor (Policy) Network: Maps observations to a probability distribution over continuous actions. For continuous control, this is a Gaussian (Normal) distribution parameterized by a learned mean vector and a state-independent log-standard-deviation.
  • Critic (Value) Network: Maps observations to a scalar estimate of the state value function V(s), used for computing advantages during policy optimization.

Both networks are multi-layer perceptrons (MLPs) with identical hidden layer structure but separate parameters. This separation allows the critic to learn value estimates without interfering with policy gradient updates, and vice versa.

Key architectural choices include:

  • Separate actor and critic networks: Unlike shared-backbone architectures, separate networks avoid gradient interference between policy and value objectives. This is particularly important in continuous control where the value function and policy can have very different optimization landscapes.
  • Learned state-independent log-standard-deviation: The action distribution standard deviation is parameterized as a learnable nn.Parameter rather than being output by the network. This means the exploration noise is the same regardless of the current state, which simplifies optimization and is empirically effective for many continuous control tasks. The parameter is initialized at -0.5, corresponding to a standard deviation of approximately 0.607.
  • Orthogonal weight initialization: All linear layers are initialized using orthogonal initialization with a gain of sqrt(2) for hidden layers. The final actor layer uses a much smaller gain (0.01 * sqrt(2)) to produce near-zero initial actions, promoting cautious initial exploration. Biases are initialized to zero.
  • Tanh activations: Hidden layers use the hyperbolic tangent activation function, which bounds outputs to [-1, 1]. This is a classical choice for RL policy networks that helps with gradient flow and is compatible with orthogonal initialization.

Usage

Use this architecture when:

  • Training PPO agents on continuous-action ManiSkill environments with state-based observations
  • The observation space is a flat vector (e.g., obs_mode="state")
  • The action space is a continuous Box space (e.g., joint position deltas)

For visual observations (RGBD, pointcloud), the MLP architecture would need to be replaced with CNN or ViT-based encoders, while the overall actor-critic structure remains the same.

Theoretical Basis

Actor-Critic Methods: The actor-critic framework combines policy gradient methods (actor) with value function approximation (critic). The actor produces actions, while the critic evaluates the quality of states. The critic's value estimates are used to compute advantages, which reduce variance in policy gradient estimates compared to Monte Carlo returns.

Gaussian Policies for Continuous Control: In continuous action spaces, the policy is represented as a multivariate Gaussian distribution. Given an observation x:

  • The actor network outputs the mean vector: mu = actor_mean(x)
  • The log-standard-deviation is a learned parameter: log_sigma
  • The standard deviation is: sigma = exp(log_sigma)
  • Actions are sampled as: a ~ N(mu, diag(sigma^2))

The diagonal covariance assumption (independent action dimensions) simplifies computation while being sufficient for most robotics tasks.

Log-Probability and Entropy: For the PPO loss computation:

  • Log-probability: log pi(a|s) = sum_i log N(a_i; mu_i, sigma_i) (summed across action dimensions)
  • Entropy: H(pi(.|s)) = sum_i (0.5 * log(2*pi*e*sigma_i^2)) (used as an exploration bonus)

Orthogonal Initialization (Saxe et al., 2014): Initializing weight matrices as orthogonal matrices preserves the norm of activations through the network, preventing vanishing or exploding gradients at the start of training. The gain factor controls the scale:

  • Hidden layers: gain = sqrt(2) (standard for Tanh activations)
  • Output layer (actor): gain = 0.01 * sqrt(2) (small initial actions for cautious exploration)
  • Output layer (critic): gain = 1.0 (default for value outputs)
Network Architecture Summary
Component Architecture Input Output Initialization
Critic 4-layer MLP (in->256->256->256->1) with Tanh obs vector (obs_dim,) scalar value V(s) Orthogonal, gain=sqrt(2); final gain=1.0
Actor Mean 4-layer MLP (in->256->256->256->act_dim) with Tanh obs vector (obs_dim,) action mean vector (act_dim,) Orthogonal, gain=sqrt(2); final gain=0.01*sqrt(2)
Actor Log-Std nn.Parameter N/A (learned constant) log-std vector (1, act_dim) Initialized at -0.5

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment