Principle:Haosulab ManiSkill Reward Success Design
| Field | Value |
|---|---|
| Page Type | Principle |
| Title | ManiSkill Reward and Success Design |
| Domain | Simulation, Robotics, Environment_Design, Reinforcement_Learning |
| Related Implementation | Implementation:Haosulab_ManiSkill_Evaluate_Dense_Reward |
| Date | 2026-02-15 |
| Repository | Haosulab/ManiSkill |
Overview
Description
Reward and success design in ManiSkill defines how the environment communicates task progress and completion to the learning agent. This is split into two complementary mechanisms:
- Success/failure evaluation (
evaluate()): A binary assessment of whether the task is completed successfully or has reached a failure state. This produces a dictionary with"success"and optionally"fail"boolean tensors, along with any intermediate computed data that can be reused by the observation and reward functions.
- Reward computation (
compute_dense_reward(),compute_normalized_dense_reward(),compute_sparse_reward()): Scalar reward signals that guide the learning agent. ManiSkill supports four reward modes:- sparse: Binary reward derived from success/failure (default: +1 for success, -1 for failure, 0 otherwise).
- dense: Continuous reward function designed to provide gradient-like guidance toward task completion.
- normalized_dense: Dense reward scaled to the [0, 1] range for easier hyperparameter tuning.
- none: Zero reward (useful for imitation learning or model-based methods).
The reward mode is selected at environment creation time via the reward_mode parameter and determines which reward computation method is called during env.step().
The evaluate() function serves as the central computation hub: it is called on every step and every reset, and its output flows into the info dictionary, the observation pipeline (via _get_obs_extra(info)), and the reward computation (via the info parameter). This design avoids redundant computation of expensive quantities like distance measurements, collision checks, or grasp detection.
Usage
Reward and success design is typically implemented after scene loading and episode initialization, as the developer needs to know what objects exist and what their goal configurations should be. The developer:
- Overrides
evaluate()to compute and return success/failure conditions and any useful intermediate data. - Overrides
compute_dense_reward()to implement a shaped reward function. - Overrides
compute_normalized_dense_reward()to provide a [0, 1]-normalized version. - The sparse reward is automatically derived from
evaluate()output unless overridden.
All reward functions must return a torch.Tensor of shape (num_envs,) -- one scalar reward per parallel environment.
Theoretical Basis
The reward design system in ManiSkill is grounded in established reinforcement learning theory and practice:
Reward Shaping (Ng et al., 1999)
Dense reward functions in ManiSkill implement reward shaping -- providing intermediate rewards that guide the agent toward the goal. The key insight from Ng et al. (1999) is that potential-based reward shaping preserves the optimal policy of the original MDP. While ManiSkill's reward functions are not strictly potential-based, they follow the same philosophy: provide continuous feedback proportional to progress toward the goal.
A common pattern in ManiSkill tasks is multi-stage reward:
- Reaching reward: Reward for the robot's end-effector approaching the target object.
- Grasping reward: Reward for establishing a stable grasp.
- Placement reward: Reward for moving the object toward the goal.
Each stage is activated by a condition mask (e.g., the placement reward is only added once the robot has reached the object), creating a natural curriculum within the reward function.
Sparse vs Dense Rewards
Sparse rewards (success/failure only) define the true task objective but provide no gradient information for learning. Dense rewards provide guidance but may introduce reward hacking if poorly designed. ManiSkill's multi-mode system allows researchers to:
- Train with dense rewards for faster learning.
- Evaluate with sparse rewards for unbiased task success measurement.
- Use normalized dense rewards for consistent hyperparameter settings across tasks.
Reward Normalization
The compute_normalized_dense_reward() method is expected to return values in the [0, 1] range. This is computed by dividing the dense reward by the maximum possible reward. Normalization facilitates:
- Consistent learning rate tuning across different tasks.
- Fair comparison of learning curves across tasks with different reward scales.
- Easier composition of multi-task learning objectives.
Separation of Evaluation and Reward
The evaluate() and compute_dense_reward() separation follows the single responsibility principle:
evaluate()answers: Is the task done? How?compute_dense_reward()answers: How much progress was made?
This separation is important because evaluation criteria should remain stable (they define the task), while reward functions may be redesigned to improve learning without changing what counts as success.
Termination Semantics
ManiSkill uses the Gymnasium convention of distinguishing termination (task success or failure) from truncation (time limit reached):
- If
evaluate()returns"success": True, the episode terminates with a positive outcome. - If
evaluate()returns"fail": True, the episode terminates with a negative outcome. - If
max_episode_stepsis reached (handled byTimeLimitWrapper), the episode is truncated.
This distinction is critical for correct value function bootstrapping in RL algorithms.
Related Pages
- Implementation:Haosulab_ManiSkill_Evaluate_Dense_Reward -- Concrete reward and evaluation patterns
- Principle:Haosulab_ManiSkill_Episode_Initialization -- Initialization determines the evaluation context
- Principle:Haosulab_ManiSkill_Observation_Definition -- Observations and rewards share the info dict
- Principle:Haosulab_ManiSkill_Environment_Testing -- Testing reward correctness