Implementation:Haosulab ManiSkill PPO Agent Network

Field	Value
implementation_name	Haosulab_ManiSkill_PPO_Agent_Network
overview	Concrete PPO actor-critic neural network for ManiSkill environments with separate actor and critic MLPs
type	Pattern Doc
domains	Reinforcement_Learning, Robotics
last_updated	2026-02-15
related_pages	Principle:Haosulab_ManiSkill_PPO_Agent_Architecture

Overview

Description

The Agent class is a PyTorch nn.Module that implements the actor-critic architecture for PPO training on ManiSkill environments. It consists of a critic network (3 hidden layers of 256 units with Tanh activation), an actor mean network (same structure), and a learned log-standard-deviation parameter. All layers use orthogonal initialization via the layer_init helper function.

This is a Pattern Doc -- it documents a user-defined component from the PPO example baseline, not a library API. Users are expected to copy and modify this code for their specific needs.

Usage

Instantiate the Agent by passing the wrapped vectorized environment (which provides single_observation_space and single_action_space). The agent is then moved to the training device and used during rollout collection and policy optimization.

Code Reference

Field	Value
Repository	https://github.com/haosulab/ManiSkill
File	`examples/baselines/ppo/ppo.py` (lines 115-161)

Helper function for orthogonal initialization:

def layer_init(layer, std=np.sqrt(2), bias_const=0.0):
    torch.nn.init.orthogonal_(layer.weight, std)
    torch.nn.init.constant_(layer.bias, bias_const)
    return layer

Agent class:

class Agent(nn.Module):
    def __init__(self, envs):
        super().__init__()
        self.critic = nn.Sequential(
            layer_init(nn.Linear(np.array(envs.single_observation_space.shape).prod(), 256)),
            nn.Tanh(),
            layer_init(nn.Linear(256, 256)),
            nn.Tanh(),
            layer_init(nn.Linear(256, 256)),
            nn.Tanh(),
            layer_init(nn.Linear(256, 1)),
        )
        self.actor_mean = nn.Sequential(
            layer_init(nn.Linear(np.array(envs.single_observation_space.shape).prod(), 256)),
            nn.Tanh(),
            layer_init(nn.Linear(256, 256)),
            nn.Tanh(),
            layer_init(nn.Linear(256, 256)),
            nn.Tanh(),
            layer_init(nn.Linear(256, np.prod(envs.single_action_space.shape)), std=0.01*np.sqrt(2)),
        )
        self.actor_logstd = nn.Parameter(
            torch.ones(1, np.prod(envs.single_action_space.shape)) * -0.5
        )

    def get_value(self, x) -> torch.Tensor:
        return self.critic(x)

    def get_action(self, x, deterministic=False) -> torch.Tensor:
        action_mean = self.actor_mean(x)
        if deterministic:
            return action_mean
        action_logstd = self.actor_logstd.expand_as(action_mean)
        action_std = torch.exp(action_logstd)
        probs = Normal(action_mean, action_std)
        return probs.sample()

    def get_action_and_value(self, x, action=None):
        action_mean = self.actor_mean(x)
        action_logstd = self.actor_logstd.expand_as(action_mean)
        action_std = torch.exp(action_logstd)
        probs = Normal(action_mean, action_std)
        if action is None:
            action = probs.sample()
        return action, probs.log_prob(action).sum(1), probs.entropy().sum(1), self.critic(x)

I/O Contract

Constructor:

Direction	Name	Type	Description
Input	envs	`ManiSkillVectorEnv`	Wrapped vectorized environment providing `single_observation_space` and `single_action_space`
Output	agent	`Agent (nn.Module)`	Actor-critic model ready for `.to(device)` and training

get_value(x):

Direction	Name	Type	Shape	Description
Input	x	`torch.Tensor`	`(batch, obs_dim)`	Observation batch
Output	value	`torch.Tensor`	`(batch, 1)`	State value estimates V(s)

get_action(x, deterministic):

Direction	Name	Type	Shape	Description
Input	x	`torch.Tensor`	`(batch, obs_dim)`	Observation batch
Input	deterministic	`bool`	scalar	If `True`, returns the mean action (no sampling)
Output	action	`torch.Tensor`	`(batch, act_dim)`	Sampled or deterministic actions

get_action_and_value(x, action):

Direction	Name	Type	Shape	Description
Input	x	`torch.Tensor`	`(batch, obs_dim)`	Observation batch
Input	action	`Optional[torch.Tensor]`	`(batch, act_dim)`	If `None`, a new action is sampled; otherwise, evaluates the given action
Output	action	`torch.Tensor`	`(batch, act_dim)`	Sampled or provided action
Output	log_prob	`torch.Tensor`	`(batch,)`	Log-probability of the action under the current policy (summed over dimensions)
Output	entropy	`torch.Tensor`	`(batch,)`	Entropy of the action distribution (summed over dimensions)
Output	value	`torch.Tensor`	`(batch, 1)`	State value estimate V(s)

Usage Examples

Example 1: Instantiate the agent and move to GPU

import torch
import numpy as np
import torch.nn as nn
from torch.distributions.normal import Normal

# Assume envs is already a ManiSkillVectorEnv
agent = Agent(envs).to(device)
optimizer = torch.optim.Adam(agent.parameters(), lr=3e-4, eps=1e-5)

Example 2: Collect an action during rollout

with torch.no_grad():
    action, logprob, _, value = agent.get_action_and_value(next_obs)
    # action: (num_envs, act_dim) - sampled from Gaussian
    # logprob: (num_envs,) - log-probability for PPO ratio computation
    # value: (num_envs, 1) - critic estimate for GAE computation

Example 3: Evaluate actions during PPO update

# During minibatch optimization, re-evaluate stored actions
_, newlogprob, entropy, newvalue = agent.get_action_and_value(
    b_obs[mb_inds],      # minibatch observations
    b_actions[mb_inds],  # minibatch actions (previously collected)
)
# newlogprob: log-probability under CURRENT policy (for importance ratio)
# entropy: for entropy bonus in loss
# newvalue: updated value estimate for value loss

Example 4: Deterministic evaluation

agent.eval()
with torch.no_grad():
    # Deterministic action = mean of the Gaussian (no sampling noise)
    eval_action = agent.get_action(eval_obs, deterministic=True)
    eval_obs, eval_rew, _, _, eval_info = eval_envs.step(eval_action)

Related Pages

Principle:Haosulab_ManiSkill_PPO_Agent_Architecture -- The principle this implementation realizes
Implementation:Haosulab_ManiSkill_PPO_Training_Loop -- How this agent is trained via PPO
Implementation:Haosulab_ManiSkill_PPO_Eval_Loop -- How this agent is evaluated during training

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment