Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Haosulab ManiSkill PPO Agent Network

From Leeroopedia
Field Value
implementation_name Haosulab_ManiSkill_PPO_Agent_Network
overview Concrete PPO actor-critic neural network for ManiSkill environments with separate actor and critic MLPs
type Pattern Doc
domains Reinforcement_Learning, Robotics
last_updated 2026-02-15
related_pages Principle:Haosulab_ManiSkill_PPO_Agent_Architecture

Overview

Description

The Agent class is a PyTorch nn.Module that implements the actor-critic architecture for PPO training on ManiSkill environments. It consists of a critic network (3 hidden layers of 256 units with Tanh activation), an actor mean network (same structure), and a learned log-standard-deviation parameter. All layers use orthogonal initialization via the layer_init helper function.

This is a Pattern Doc -- it documents a user-defined component from the PPO example baseline, not a library API. Users are expected to copy and modify this code for their specific needs.

Usage

Instantiate the Agent by passing the wrapped vectorized environment (which provides single_observation_space and single_action_space). The agent is then moved to the training device and used during rollout collection and policy optimization.

Code Reference

Field Value
Repository https://github.com/haosulab/ManiSkill
File examples/baselines/ppo/ppo.py (lines 115-161)

Helper function for orthogonal initialization:

def layer_init(layer, std=np.sqrt(2), bias_const=0.0):
    torch.nn.init.orthogonal_(layer.weight, std)
    torch.nn.init.constant_(layer.bias, bias_const)
    return layer

Agent class:

class Agent(nn.Module):
    def __init__(self, envs):
        super().__init__()
        self.critic = nn.Sequential(
            layer_init(nn.Linear(np.array(envs.single_observation_space.shape).prod(), 256)),
            nn.Tanh(),
            layer_init(nn.Linear(256, 256)),
            nn.Tanh(),
            layer_init(nn.Linear(256, 256)),
            nn.Tanh(),
            layer_init(nn.Linear(256, 1)),
        )
        self.actor_mean = nn.Sequential(
            layer_init(nn.Linear(np.array(envs.single_observation_space.shape).prod(), 256)),
            nn.Tanh(),
            layer_init(nn.Linear(256, 256)),
            nn.Tanh(),
            layer_init(nn.Linear(256, 256)),
            nn.Tanh(),
            layer_init(nn.Linear(256, np.prod(envs.single_action_space.shape)), std=0.01*np.sqrt(2)),
        )
        self.actor_logstd = nn.Parameter(
            torch.ones(1, np.prod(envs.single_action_space.shape)) * -0.5
        )

    def get_value(self, x) -> torch.Tensor:
        return self.critic(x)

    def get_action(self, x, deterministic=False) -> torch.Tensor:
        action_mean = self.actor_mean(x)
        if deterministic:
            return action_mean
        action_logstd = self.actor_logstd.expand_as(action_mean)
        action_std = torch.exp(action_logstd)
        probs = Normal(action_mean, action_std)
        return probs.sample()

    def get_action_and_value(self, x, action=None):
        action_mean = self.actor_mean(x)
        action_logstd = self.actor_logstd.expand_as(action_mean)
        action_std = torch.exp(action_logstd)
        probs = Normal(action_mean, action_std)
        if action is None:
            action = probs.sample()
        return action, probs.log_prob(action).sum(1), probs.entropy().sum(1), self.critic(x)

I/O Contract

Constructor:

Direction Name Type Description
Input envs ManiSkillVectorEnv Wrapped vectorized environment providing single_observation_space and single_action_space
Output agent Agent (nn.Module) Actor-critic model ready for .to(device) and training

get_value(x):

Direction Name Type Shape Description
Input x torch.Tensor (batch, obs_dim) Observation batch
Output value torch.Tensor (batch, 1) State value estimates V(s)

get_action(x, deterministic):

Direction Name Type Shape Description
Input x torch.Tensor (batch, obs_dim) Observation batch
Input deterministic bool scalar If True, returns the mean action (no sampling)
Output action torch.Tensor (batch, act_dim) Sampled or deterministic actions

get_action_and_value(x, action):

Direction Name Type Shape Description
Input x torch.Tensor (batch, obs_dim) Observation batch
Input action Optional[torch.Tensor] (batch, act_dim) If None, a new action is sampled; otherwise, evaluates the given action
Output action torch.Tensor (batch, act_dim) Sampled or provided action
Output log_prob torch.Tensor (batch,) Log-probability of the action under the current policy (summed over dimensions)
Output entropy torch.Tensor (batch,) Entropy of the action distribution (summed over dimensions)
Output value torch.Tensor (batch, 1) State value estimate V(s)

Usage Examples

Example 1: Instantiate the agent and move to GPU

import torch
import numpy as np
import torch.nn as nn
from torch.distributions.normal import Normal

# Assume envs is already a ManiSkillVectorEnv
agent = Agent(envs).to(device)
optimizer = torch.optim.Adam(agent.parameters(), lr=3e-4, eps=1e-5)

Example 2: Collect an action during rollout

with torch.no_grad():
    action, logprob, _, value = agent.get_action_and_value(next_obs)
    # action: (num_envs, act_dim) - sampled from Gaussian
    # logprob: (num_envs,) - log-probability for PPO ratio computation
    # value: (num_envs, 1) - critic estimate for GAE computation

Example 3: Evaluate actions during PPO update

# During minibatch optimization, re-evaluate stored actions
_, newlogprob, entropy, newvalue = agent.get_action_and_value(
    b_obs[mb_inds],      # minibatch observations
    b_actions[mb_inds],  # minibatch actions (previously collected)
)
# newlogprob: log-probability under CURRENT policy (for importance ratio)
# entropy: for entropy bonus in loss
# newvalue: updated value estimate for value loss

Example 4: Deterministic evaluation

agent.eval()
with torch.no_grad():
    # Deterministic action = mean of the Gaussian (no sampling noise)
    eval_action = agent.get_action(eval_obs, deterministic=True)
    eval_obs, eval_rew, _, _, eval_info = eval_envs.step(eval_action)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment