Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Haosulab ManiSkill Evaluate Dense Reward

From Leeroopedia
Field Value
Page Type Implementation (Pattern Doc)
Title ManiSkill evaluate() and compute_dense_reward()
Domain Simulation, Robotics, Environment_Design, Reinforcement_Learning
Related Principle Principle:Haosulab_ManiSkill_Reward_Success_Design
Source File mani_skill/envs/sapien_env.py (L1134-1144 evaluate, L698-720 compute_dense_reward)
Date 2026-02-15
Repository Haosulab/ManiSkill

Overview

Description

This document describes the concrete interfaces for task evaluation and reward computation in ManiSkill:

  • evaluate(): Returns a dictionary containing at minimum a "success" boolean tensor, optionally a "fail" boolean tensor, and any additional intermediate data useful for observations and rewards.
  • compute_dense_reward(): Returns a float tensor of shape (num_envs,) representing the dense reward for the current step.
  • compute_normalized_dense_reward(): Returns the dense reward normalized to the [0, 1] range.
  • compute_sparse_reward(): Default implementation returns +1 for success, -1 for failure, 0 otherwise. Can be overridden.

Usage

Override these methods in your BaseEnv subclass. The reward mode is selected at environment creation and determines which method is called during step().

Code Reference

evaluate() (sapien_env.py L1134-1144)

def evaluate(self) -> dict:
    """
    Evaluate whether the environment is currently in a success state
    by returning a dictionary with a "success" key or a failure state
    via a "fail" key.

    This function may also return additional data that has been computed
    (e.g. is the robot grasping some object) that may be reused when
    generating observations and rewards.

    By default if not overridden this function returns an empty dictionary.
    """
    return dict()

compute_dense_reward() (sapien_env.py L698-707)

def compute_dense_reward(self, obs: Any, action: torch.Tensor, info: dict):
    """
    Compute the dense reward.

    Args:
        obs (Any): The observation data. By default the observation data
            will be in its most raw form, a dictionary (no flattening,
            wrappers etc.)
        action (torch.Tensor): The most recent action.
        info (dict): The info dictionary (output of evaluate()).

    Returns:
        torch.Tensor: Reward tensor of shape (num_envs,).

    Raises:
        NotImplementedError if not overridden.
    """
    raise NotImplementedError()

compute_normalized_dense_reward() (sapien_env.py L709-720)

def compute_normalized_dense_reward(
    self, obs: Any, action: torch.Tensor, info: dict
):
    """
    Compute the normalized dense reward (expected range [0, 1]).

    Args:
        obs (Any): The observation data.
        action (torch.Tensor): The most recent action.
        info (dict): The info dictionary.

    Returns:
        torch.Tensor: Normalized reward tensor of shape (num_envs,).

    Raises:
        NotImplementedError if not overridden.
    """
    raise NotImplementedError()

compute_sparse_reward() (sapien_env.py L672-696)

def compute_sparse_reward(self, obs: Any, action: torch.Tensor, info: dict):
    """
    Default sparse reward: +1 for success, -1 for fail, 0 otherwise.
    Uses info["success"] and info["fail"] if present.

    Returns:
        torch.Tensor: Sparse reward of shape (num_envs,).
    """

I/O Contract

evaluate()

Parameter Type Description
(none) -- Reads internal state directly from self.obj, self.agent, etc.

Returns: dict with the following keys:

Key Type Required Description
"success" torch.Tensor (bool, shape (num_envs,)) Recommended True for environments that have achieved the goal
"fail" torch.Tensor (bool, shape (num_envs,)) Optional True for environments in an irrecoverable failure state
(custom keys) torch.Tensor Optional Any intermediate computations (distances, grasp state, etc.)

Termination logic: The step() method uses the returned dict to compute terminated = success | fail. If only "success" is present, terminated = success. If neither is present, terminated is all False.

compute_dense_reward()

Parameter Type Description
obs Any Raw observation dictionary (unflattened)
action torch.Tensor The most recent action taken, shape (num_envs, action_dim)
info dict The info dictionary from evaluate()

Returns: torch.Tensor of shape (num_envs,) with dtype=torch.float.

Usage Examples

Simple Success Evaluation

def evaluate(self):
    # Success if cube is within goal radius and on the table
    obj_to_goal = torch.linalg.norm(
        self.obj.pose.p[..., :2] - self.goal_region.pose.p[..., :2], axis=1
    )
    is_obj_placed = (obj_to_goal < self.goal_radius) & (
        self.obj.pose.p[..., 2] < self.cube_half_size + 5e-3
    )
    return {"success": is_obj_placed}

Evaluation With Intermediate Data

def evaluate(self):
    obj_to_goal_dist = torch.linalg.norm(
        self.obj.pose.p - self.goal_pos, axis=1
    )
    is_grasped = self.agent.is_grasping(self.obj)
    success = (obj_to_goal_dist < 0.05) & is_grasped
    return {
        "success": success,
        "obj_to_goal_dist": obj_to_goal_dist,
        "is_grasped": is_grasped,
    }

Multi-Stage Dense Reward (PushCube Pattern)

def compute_dense_reward(self, obs: Any, action: torch.Tensor, info: dict):
    # Stage 1: Reaching reward -- move TCP to push position behind cube
    tcp_push_pose = Pose.create_from_pq(
        p=self.obj.pose.p
        + torch.tensor([-self.cube_half_size - 0.005, 0, 0], device=self.device)
    )
    tcp_to_push_dist = torch.linalg.norm(
        tcp_push_pose.p - self.agent.tcp.pose.p, axis=1
    )
    reaching_reward = 1 - torch.tanh(5 * tcp_to_push_dist)
    reward = reaching_reward

    # Stage 2: Placement reward -- move cube toward goal (activated after reaching)
    reached = tcp_to_push_dist < 0.01
    obj_to_goal_dist = torch.linalg.norm(
        self.obj.pose.p[..., :2] - self.goal_region.pose.p[..., :2], axis=1
    )
    place_reward = 1 - torch.tanh(5 * obj_to_goal_dist)
    reward += place_reward * reached

    # Stage 3: Height maintenance -- keep cube on table surface
    z_deviation = torch.abs(self.obj.pose.p[..., 2] - self.cube_half_size)
    z_reward = 1 - torch.tanh(5 * z_deviation)
    reward += place_reward * z_reward * reached

    # Override with max reward for successful environments
    reward[info["success"]] = 4
    return reward

Normalized Dense Reward

def compute_normalized_dense_reward(self, obs: Any, action: torch.Tensor, info: dict):
    max_reward = 4.0
    return self.compute_dense_reward(obs=obs, action=action, info=info) / max_reward

Evaluation With Success and Failure

def evaluate(self):
    obj_pos = self.obj.pose.p
    # Success: object is at goal
    success = torch.linalg.norm(obj_pos - self.goal_pos, axis=1) < 0.05
    # Failure: object fell off the table
    fail = obj_pos[..., 2] < -0.1
    return {"success": success, "fail": fail}

Dense Reward With Action Penalty

def compute_dense_reward(self, obs: Any, action: torch.Tensor, info: dict):
    # Distance-based reward
    obj_to_goal = torch.linalg.norm(
        self.obj.pose.p - self.goal_pos, axis=1
    )
    distance_reward = 1 - torch.tanh(3 * obj_to_goal)

    # Action regularization penalty
    action_penalty = 0.01 * torch.linalg.norm(action, axis=1)

    reward = distance_reward - action_penalty
    reward[info["success"]] = 2.0
    return reward

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment