Implementation:Haosulab ManiSkill PPO Agent Network
| Field | Value |
|---|---|
| implementation_name | Haosulab_ManiSkill_PPO_Agent_Network |
| overview | Concrete PPO actor-critic neural network for ManiSkill environments with separate actor and critic MLPs |
| type | Pattern Doc |
| domains | Reinforcement_Learning, Robotics |
| last_updated | 2026-02-15 |
| related_pages | Principle:Haosulab_ManiSkill_PPO_Agent_Architecture |
Overview
Description
The Agent class is a PyTorch nn.Module that implements the actor-critic architecture for PPO training on ManiSkill environments. It consists of a critic network (3 hidden layers of 256 units with Tanh activation), an actor mean network (same structure), and a learned log-standard-deviation parameter. All layers use orthogonal initialization via the layer_init helper function.
This is a Pattern Doc -- it documents a user-defined component from the PPO example baseline, not a library API. Users are expected to copy and modify this code for their specific needs.
Usage
Instantiate the Agent by passing the wrapped vectorized environment (which provides single_observation_space and single_action_space). The agent is then moved to the training device and used during rollout collection and policy optimization.
Code Reference
| Field | Value |
|---|---|
| Repository | https://github.com/haosulab/ManiSkill |
| File | examples/baselines/ppo/ppo.py (lines 115-161)
|
Helper function for orthogonal initialization:
def layer_init(layer, std=np.sqrt(2), bias_const=0.0):
torch.nn.init.orthogonal_(layer.weight, std)
torch.nn.init.constant_(layer.bias, bias_const)
return layer
Agent class:
class Agent(nn.Module):
def __init__(self, envs):
super().__init__()
self.critic = nn.Sequential(
layer_init(nn.Linear(np.array(envs.single_observation_space.shape).prod(), 256)),
nn.Tanh(),
layer_init(nn.Linear(256, 256)),
nn.Tanh(),
layer_init(nn.Linear(256, 256)),
nn.Tanh(),
layer_init(nn.Linear(256, 1)),
)
self.actor_mean = nn.Sequential(
layer_init(nn.Linear(np.array(envs.single_observation_space.shape).prod(), 256)),
nn.Tanh(),
layer_init(nn.Linear(256, 256)),
nn.Tanh(),
layer_init(nn.Linear(256, 256)),
nn.Tanh(),
layer_init(nn.Linear(256, np.prod(envs.single_action_space.shape)), std=0.01*np.sqrt(2)),
)
self.actor_logstd = nn.Parameter(
torch.ones(1, np.prod(envs.single_action_space.shape)) * -0.5
)
def get_value(self, x) -> torch.Tensor:
return self.critic(x)
def get_action(self, x, deterministic=False) -> torch.Tensor:
action_mean = self.actor_mean(x)
if deterministic:
return action_mean
action_logstd = self.actor_logstd.expand_as(action_mean)
action_std = torch.exp(action_logstd)
probs = Normal(action_mean, action_std)
return probs.sample()
def get_action_and_value(self, x, action=None):
action_mean = self.actor_mean(x)
action_logstd = self.actor_logstd.expand_as(action_mean)
action_std = torch.exp(action_logstd)
probs = Normal(action_mean, action_std)
if action is None:
action = probs.sample()
return action, probs.log_prob(action).sum(1), probs.entropy().sum(1), self.critic(x)
I/O Contract
Constructor:
| Direction | Name | Type | Description |
|---|---|---|---|
| Input | envs | ManiSkillVectorEnv |
Wrapped vectorized environment providing single_observation_space and single_action_space
|
| Output | agent | Agent (nn.Module) |
Actor-critic model ready for .to(device) and training
|
get_value(x):
| Direction | Name | Type | Shape | Description |
|---|---|---|---|---|
| Input | x | torch.Tensor |
(batch, obs_dim) |
Observation batch |
| Output | value | torch.Tensor |
(batch, 1) |
State value estimates V(s) |
get_action(x, deterministic):
| Direction | Name | Type | Shape | Description |
|---|---|---|---|---|
| Input | x | torch.Tensor |
(batch, obs_dim) |
Observation batch |
| Input | deterministic | bool |
scalar | If True, returns the mean action (no sampling)
|
| Output | action | torch.Tensor |
(batch, act_dim) |
Sampled or deterministic actions |
get_action_and_value(x, action):
| Direction | Name | Type | Shape | Description |
|---|---|---|---|---|
| Input | x | torch.Tensor |
(batch, obs_dim) |
Observation batch |
| Input | action | Optional[torch.Tensor] |
(batch, act_dim) |
If None, a new action is sampled; otherwise, evaluates the given action
|
| Output | action | torch.Tensor |
(batch, act_dim) |
Sampled or provided action |
| Output | log_prob | torch.Tensor |
(batch,) |
Log-probability of the action under the current policy (summed over dimensions) |
| Output | entropy | torch.Tensor |
(batch,) |
Entropy of the action distribution (summed over dimensions) |
| Output | value | torch.Tensor |
(batch, 1) |
State value estimate V(s) |
Usage Examples
Example 1: Instantiate the agent and move to GPU
import torch
import numpy as np
import torch.nn as nn
from torch.distributions.normal import Normal
# Assume envs is already a ManiSkillVectorEnv
agent = Agent(envs).to(device)
optimizer = torch.optim.Adam(agent.parameters(), lr=3e-4, eps=1e-5)
Example 2: Collect an action during rollout
with torch.no_grad():
action, logprob, _, value = agent.get_action_and_value(next_obs)
# action: (num_envs, act_dim) - sampled from Gaussian
# logprob: (num_envs,) - log-probability for PPO ratio computation
# value: (num_envs, 1) - critic estimate for GAE computation
Example 3: Evaluate actions during PPO update
# During minibatch optimization, re-evaluate stored actions
_, newlogprob, entropy, newvalue = agent.get_action_and_value(
b_obs[mb_inds], # minibatch observations
b_actions[mb_inds], # minibatch actions (previously collected)
)
# newlogprob: log-probability under CURRENT policy (for importance ratio)
# entropy: for entropy bonus in loss
# newvalue: updated value estimate for value loss
Example 4: Deterministic evaluation
agent.eval()
with torch.no_grad():
# Deterministic action = mean of the Gaussian (no sampling noise)
eval_action = agent.get_action(eval_obs, deterministic=True)
eval_obs, eval_rew, _, _, eval_info = eval_envs.step(eval_action)
Related Pages
- Principle:Haosulab_ManiSkill_PPO_Agent_Architecture -- The principle this implementation realizes
- Implementation:Haosulab_ManiSkill_PPO_Training_Loop -- How this agent is trained via PPO
- Implementation:Haosulab_ManiSkill_PPO_Eval_Loop -- How this agent is evaluated during training