Implementation:Haosulab ManiSkill Evaluate Dense Reward

Field	Value
Page Type	Implementation (Pattern Doc)
Title	ManiSkill evaluate() and compute_dense_reward()
Domain	Simulation, Robotics, Environment_Design, Reinforcement_Learning
Related Principle	Principle:Haosulab_ManiSkill_Reward_Success_Design
Source File	`mani_skill/envs/sapien_env.py` (L1134-1144 evaluate, L698-720 compute_dense_reward)
Date	2026-02-15
Repository	Haosulab/ManiSkill

Overview

Description

This document describes the concrete interfaces for task evaluation and reward computation in ManiSkill:

evaluate(): Returns a dictionary containing at minimum a "success" boolean tensor, optionally a "fail" boolean tensor, and any additional intermediate data useful for observations and rewards.

compute_dense_reward(): Returns a float tensor of shape (num_envs,) representing the dense reward for the current step.

compute_normalized_dense_reward(): Returns the dense reward normalized to the [0, 1] range.

compute_sparse_reward(): Default implementation returns +1 for success, -1 for failure, 0 otherwise. Can be overridden.

Usage

Override these methods in your BaseEnv subclass. The reward mode is selected at environment creation and determines which method is called during step().

Code Reference

evaluate() (sapien_env.py L1134-1144)

def evaluate(self) -> dict:
    """
    Evaluate whether the environment is currently in a success state
    by returning a dictionary with a "success" key or a failure state
    via a "fail" key.

    This function may also return additional data that has been computed
    (e.g. is the robot grasping some object) that may be reused when
    generating observations and rewards.

    By default if not overridden this function returns an empty dictionary.
    """
    return dict()

compute_dense_reward() (sapien_env.py L698-707)

def compute_dense_reward(self, obs: Any, action: torch.Tensor, info: dict):
    """
    Compute the dense reward.

    Args:
        obs (Any): The observation data. By default the observation data
            will be in its most raw form, a dictionary (no flattening,
            wrappers etc.)
        action (torch.Tensor): The most recent action.
        info (dict): The info dictionary (output of evaluate()).

    Returns:
        torch.Tensor: Reward tensor of shape (num_envs,).

    Raises:
        NotImplementedError if not overridden.
    """
    raise NotImplementedError()

compute_normalized_dense_reward() (sapien_env.py L709-720)

def compute_normalized_dense_reward(
    self, obs: Any, action: torch.Tensor, info: dict
):
    """
    Compute the normalized dense reward (expected range [0, 1]).

    Args:
        obs (Any): The observation data.
        action (torch.Tensor): The most recent action.
        info (dict): The info dictionary.

    Returns:
        torch.Tensor: Normalized reward tensor of shape (num_envs,).

    Raises:
        NotImplementedError if not overridden.
    """
    raise NotImplementedError()

compute_sparse_reward() (sapien_env.py L672-696)

def compute_sparse_reward(self, obs: Any, action: torch.Tensor, info: dict):
    """
    Default sparse reward: +1 for success, -1 for fail, 0 otherwise.
    Uses info["success"] and info["fail"] if present.

    Returns:
        torch.Tensor: Sparse reward of shape (num_envs,).
    """

I/O Contract

evaluate()

Parameter	Type	Description
(none)	--	Reads internal state directly from self.obj, self.agent, etc.

Returns: dict with the following keys:

Key	Type	Required	Description
`"success"`	`torch.Tensor` (bool, shape `(num_envs,)`)	Recommended	`True` for environments that have achieved the goal
`"fail"`	`torch.Tensor` (bool, shape `(num_envs,)`)	Optional	`True` for environments in an irrecoverable failure state
(custom keys)	`torch.Tensor`	Optional	Any intermediate computations (distances, grasp state, etc.)

Termination logic: The step() method uses the returned dict to compute terminated = success | fail. If only "success" is present, terminated = success. If neither is present, terminated is all False.

compute_dense_reward()

Parameter	Type	Description
`obs`	`Any`	Raw observation dictionary (unflattened)
`action`	`torch.Tensor`	The most recent action taken, shape `(num_envs, action_dim)`
`info`	`dict`	The info dictionary from `evaluate()`

Returns: torch.Tensor of shape (num_envs,) with dtype=torch.float.

Usage Examples

Simple Success Evaluation

def evaluate(self):
    # Success if cube is within goal radius and on the table
    obj_to_goal = torch.linalg.norm(
        self.obj.pose.p[..., :2] - self.goal_region.pose.p[..., :2], axis=1
    )
    is_obj_placed = (obj_to_goal < self.goal_radius) & (
        self.obj.pose.p[..., 2] < self.cube_half_size + 5e-3
    )
    return {"success": is_obj_placed}

Evaluation With Intermediate Data

def evaluate(self):
    obj_to_goal_dist = torch.linalg.norm(
        self.obj.pose.p - self.goal_pos, axis=1
    )
    is_grasped = self.agent.is_grasping(self.obj)
    success = (obj_to_goal_dist < 0.05) & is_grasped
    return {
        "success": success,
        "obj_to_goal_dist": obj_to_goal_dist,
        "is_grasped": is_grasped,
    }

Multi-Stage Dense Reward (PushCube Pattern)

def compute_dense_reward(self, obs: Any, action: torch.Tensor, info: dict):
    # Stage 1: Reaching reward -- move TCP to push position behind cube
    tcp_push_pose = Pose.create_from_pq(
        p=self.obj.pose.p
        + torch.tensor([-self.cube_half_size - 0.005, 0, 0], device=self.device)
    )
    tcp_to_push_dist = torch.linalg.norm(
        tcp_push_pose.p - self.agent.tcp.pose.p, axis=1
    )
    reaching_reward = 1 - torch.tanh(5 * tcp_to_push_dist)
    reward = reaching_reward

    # Stage 2: Placement reward -- move cube toward goal (activated after reaching)
    reached = tcp_to_push_dist < 0.01
    obj_to_goal_dist = torch.linalg.norm(
        self.obj.pose.p[..., :2] - self.goal_region.pose.p[..., :2], axis=1
    )
    place_reward = 1 - torch.tanh(5 * obj_to_goal_dist)
    reward += place_reward * reached

    # Stage 3: Height maintenance -- keep cube on table surface
    z_deviation = torch.abs(self.obj.pose.p[..., 2] - self.cube_half_size)
    z_reward = 1 - torch.tanh(5 * z_deviation)
    reward += place_reward * z_reward * reached

    # Override with max reward for successful environments
    reward[info["success"]] = 4
    return reward

Normalized Dense Reward

def compute_normalized_dense_reward(self, obs: Any, action: torch.Tensor, info: dict):
    max_reward = 4.0
    return self.compute_dense_reward(obs=obs, action=action, info=info) / max_reward

Evaluation With Success and Failure

def evaluate(self):
    obj_pos = self.obj.pose.p
    # Success: object is at goal
    success = torch.linalg.norm(obj_pos - self.goal_pos, axis=1) < 0.05
    # Failure: object fell off the table
    fail = obj_pos[..., 2] < -0.1
    return {"success": success, "fail": fail}

Dense Reward With Action Penalty

def compute_dense_reward(self, obs: Any, action: torch.Tensor, info: dict):
    # Distance-based reward
    obj_to_goal = torch.linalg.norm(
        self.obj.pose.p - self.goal_pos, axis=1
    )
    distance_reward = 1 - torch.tanh(3 * obj_to_goal)

    # Action regularization penalty
    action_penalty = 0.01 * torch.linalg.norm(action, axis=1)

    reward = distance_reward - action_penalty
    reward[info["success"]] = 2.0
    return reward

Related Pages

Principle:Haosulab_ManiSkill_Reward_Success_Design -- The principle this implements
Implementation:Haosulab_ManiSkill_Get_Obs_Extra_CameraConfig -- Observations that consume evaluate() output
Implementation:Haosulab_ManiSkill_Initialize_Episode_Pattern -- Initialization that sets up the evaluation context
Implementation:Haosulab_ManiSkill_Demo_Random_Action_CLI -- Testing reward values with random actions
Heuristic:Haosulab_ManiSkill_Physics_Solver_Tuning

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment