Implementation:Haosulab ManiSkill Evaluate Dense Reward
Appearance
| Field | Value |
|---|---|
| Page Type | Implementation (Pattern Doc) |
| Title | ManiSkill evaluate() and compute_dense_reward() |
| Domain | Simulation, Robotics, Environment_Design, Reinforcement_Learning |
| Related Principle | Principle:Haosulab_ManiSkill_Reward_Success_Design |
| Source File | mani_skill/envs/sapien_env.py (L1134-1144 evaluate, L698-720 compute_dense_reward)
|
| Date | 2026-02-15 |
| Repository | Haosulab/ManiSkill |
Overview
Description
This document describes the concrete interfaces for task evaluation and reward computation in ManiSkill:
evaluate(): Returns a dictionary containing at minimum a"success"boolean tensor, optionally a"fail"boolean tensor, and any additional intermediate data useful for observations and rewards.
compute_dense_reward(): Returns a float tensor of shape(num_envs,)representing the dense reward for the current step.
compute_normalized_dense_reward(): Returns the dense reward normalized to the [0, 1] range.
compute_sparse_reward(): Default implementation returns +1 for success, -1 for failure, 0 otherwise. Can be overridden.
Usage
Override these methods in your BaseEnv subclass. The reward mode is selected at environment creation and determines which method is called during step().
Code Reference
evaluate() (sapien_env.py L1134-1144)
def evaluate(self) -> dict:
"""
Evaluate whether the environment is currently in a success state
by returning a dictionary with a "success" key or a failure state
via a "fail" key.
This function may also return additional data that has been computed
(e.g. is the robot grasping some object) that may be reused when
generating observations and rewards.
By default if not overridden this function returns an empty dictionary.
"""
return dict()
compute_dense_reward() (sapien_env.py L698-707)
def compute_dense_reward(self, obs: Any, action: torch.Tensor, info: dict):
"""
Compute the dense reward.
Args:
obs (Any): The observation data. By default the observation data
will be in its most raw form, a dictionary (no flattening,
wrappers etc.)
action (torch.Tensor): The most recent action.
info (dict): The info dictionary (output of evaluate()).
Returns:
torch.Tensor: Reward tensor of shape (num_envs,).
Raises:
NotImplementedError if not overridden.
"""
raise NotImplementedError()
compute_normalized_dense_reward() (sapien_env.py L709-720)
def compute_normalized_dense_reward(
self, obs: Any, action: torch.Tensor, info: dict
):
"""
Compute the normalized dense reward (expected range [0, 1]).
Args:
obs (Any): The observation data.
action (torch.Tensor): The most recent action.
info (dict): The info dictionary.
Returns:
torch.Tensor: Normalized reward tensor of shape (num_envs,).
Raises:
NotImplementedError if not overridden.
"""
raise NotImplementedError()
compute_sparse_reward() (sapien_env.py L672-696)
def compute_sparse_reward(self, obs: Any, action: torch.Tensor, info: dict):
"""
Default sparse reward: +1 for success, -1 for fail, 0 otherwise.
Uses info["success"] and info["fail"] if present.
Returns:
torch.Tensor: Sparse reward of shape (num_envs,).
"""
I/O Contract
evaluate()
| Parameter | Type | Description |
|---|---|---|
| (none) | -- | Reads internal state directly from self.obj, self.agent, etc. |
Returns: dict with the following keys:
| Key | Type | Required | Description |
|---|---|---|---|
"success" |
torch.Tensor (bool, shape (num_envs,)) |
Recommended | True for environments that have achieved the goal
|
"fail" |
torch.Tensor (bool, shape (num_envs,)) |
Optional | True for environments in an irrecoverable failure state
|
| (custom keys) | torch.Tensor |
Optional | Any intermediate computations (distances, grasp state, etc.) |
Termination logic: The step() method uses the returned dict to compute terminated = success | fail. If only "success" is present, terminated = success. If neither is present, terminated is all False.
compute_dense_reward()
| Parameter | Type | Description |
|---|---|---|
obs |
Any |
Raw observation dictionary (unflattened) |
action |
torch.Tensor |
The most recent action taken, shape (num_envs, action_dim)
|
info |
dict |
The info dictionary from evaluate()
|
Returns: torch.Tensor of shape (num_envs,) with dtype=torch.float.
Usage Examples
Simple Success Evaluation
def evaluate(self):
# Success if cube is within goal radius and on the table
obj_to_goal = torch.linalg.norm(
self.obj.pose.p[..., :2] - self.goal_region.pose.p[..., :2], axis=1
)
is_obj_placed = (obj_to_goal < self.goal_radius) & (
self.obj.pose.p[..., 2] < self.cube_half_size + 5e-3
)
return {"success": is_obj_placed}
Evaluation With Intermediate Data
def evaluate(self):
obj_to_goal_dist = torch.linalg.norm(
self.obj.pose.p - self.goal_pos, axis=1
)
is_grasped = self.agent.is_grasping(self.obj)
success = (obj_to_goal_dist < 0.05) & is_grasped
return {
"success": success,
"obj_to_goal_dist": obj_to_goal_dist,
"is_grasped": is_grasped,
}
Multi-Stage Dense Reward (PushCube Pattern)
def compute_dense_reward(self, obs: Any, action: torch.Tensor, info: dict):
# Stage 1: Reaching reward -- move TCP to push position behind cube
tcp_push_pose = Pose.create_from_pq(
p=self.obj.pose.p
+ torch.tensor([-self.cube_half_size - 0.005, 0, 0], device=self.device)
)
tcp_to_push_dist = torch.linalg.norm(
tcp_push_pose.p - self.agent.tcp.pose.p, axis=1
)
reaching_reward = 1 - torch.tanh(5 * tcp_to_push_dist)
reward = reaching_reward
# Stage 2: Placement reward -- move cube toward goal (activated after reaching)
reached = tcp_to_push_dist < 0.01
obj_to_goal_dist = torch.linalg.norm(
self.obj.pose.p[..., :2] - self.goal_region.pose.p[..., :2], axis=1
)
place_reward = 1 - torch.tanh(5 * obj_to_goal_dist)
reward += place_reward * reached
# Stage 3: Height maintenance -- keep cube on table surface
z_deviation = torch.abs(self.obj.pose.p[..., 2] - self.cube_half_size)
z_reward = 1 - torch.tanh(5 * z_deviation)
reward += place_reward * z_reward * reached
# Override with max reward for successful environments
reward[info["success"]] = 4
return reward
Normalized Dense Reward
def compute_normalized_dense_reward(self, obs: Any, action: torch.Tensor, info: dict):
max_reward = 4.0
return self.compute_dense_reward(obs=obs, action=action, info=info) / max_reward
Evaluation With Success and Failure
def evaluate(self):
obj_pos = self.obj.pose.p
# Success: object is at goal
success = torch.linalg.norm(obj_pos - self.goal_pos, axis=1) < 0.05
# Failure: object fell off the table
fail = obj_pos[..., 2] < -0.1
return {"success": success, "fail": fail}
Dense Reward With Action Penalty
def compute_dense_reward(self, obs: Any, action: torch.Tensor, info: dict):
# Distance-based reward
obj_to_goal = torch.linalg.norm(
self.obj.pose.p - self.goal_pos, axis=1
)
distance_reward = 1 - torch.tanh(3 * obj_to_goal)
# Action regularization penalty
action_penalty = 0.01 * torch.linalg.norm(action, axis=1)
reward = distance_reward - action_penalty
reward[info["success"]] = 2.0
return reward
Related Pages
- Principle:Haosulab_ManiSkill_Reward_Success_Design -- The principle this implements
- Implementation:Haosulab_ManiSkill_Get_Obs_Extra_CameraConfig -- Observations that consume evaluate() output
- Implementation:Haosulab_ManiSkill_Initialize_Episode_Pattern -- Initialization that sets up the evaluation context
- Implementation:Haosulab_ManiSkill_Demo_Random_Action_CLI -- Testing reward values with random actions
- Heuristic:Haosulab_ManiSkill_Physics_Solver_Tuning
Page Connections
Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment