Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:Haosulab ManiSkill Reward Success Design

From Leeroopedia
Field Value
Page Type Principle
Title ManiSkill Reward and Success Design
Domain Simulation, Robotics, Environment_Design, Reinforcement_Learning
Related Implementation Implementation:Haosulab_ManiSkill_Evaluate_Dense_Reward
Date 2026-02-15
Repository Haosulab/ManiSkill

Overview

Description

Reward and success design in ManiSkill defines how the environment communicates task progress and completion to the learning agent. This is split into two complementary mechanisms:

  • Success/failure evaluation (evaluate()): A binary assessment of whether the task is completed successfully or has reached a failure state. This produces a dictionary with "success" and optionally "fail" boolean tensors, along with any intermediate computed data that can be reused by the observation and reward functions.
  • Reward computation (compute_dense_reward(), compute_normalized_dense_reward(), compute_sparse_reward()): Scalar reward signals that guide the learning agent. ManiSkill supports four reward modes:
    • sparse: Binary reward derived from success/failure (default: +1 for success, -1 for failure, 0 otherwise).
    • dense: Continuous reward function designed to provide gradient-like guidance toward task completion.
    • normalized_dense: Dense reward scaled to the [0, 1] range for easier hyperparameter tuning.
    • none: Zero reward (useful for imitation learning or model-based methods).

The reward mode is selected at environment creation time via the reward_mode parameter and determines which reward computation method is called during env.step().

The evaluate() function serves as the central computation hub: it is called on every step and every reset, and its output flows into the info dictionary, the observation pipeline (via _get_obs_extra(info)), and the reward computation (via the info parameter). This design avoids redundant computation of expensive quantities like distance measurements, collision checks, or grasp detection.

Usage

Reward and success design is typically implemented after scene loading and episode initialization, as the developer needs to know what objects exist and what their goal configurations should be. The developer:

  1. Overrides evaluate() to compute and return success/failure conditions and any useful intermediate data.
  2. Overrides compute_dense_reward() to implement a shaped reward function.
  3. Overrides compute_normalized_dense_reward() to provide a [0, 1]-normalized version.
  4. The sparse reward is automatically derived from evaluate() output unless overridden.

All reward functions must return a torch.Tensor of shape (num_envs,) -- one scalar reward per parallel environment.

Theoretical Basis

The reward design system in ManiSkill is grounded in established reinforcement learning theory and practice:

Reward Shaping (Ng et al., 1999)

Dense reward functions in ManiSkill implement reward shaping -- providing intermediate rewards that guide the agent toward the goal. The key insight from Ng et al. (1999) is that potential-based reward shaping preserves the optimal policy of the original MDP. While ManiSkill's reward functions are not strictly potential-based, they follow the same philosophy: provide continuous feedback proportional to progress toward the goal.

A common pattern in ManiSkill tasks is multi-stage reward:

  1. Reaching reward: Reward for the robot's end-effector approaching the target object.
  2. Grasping reward: Reward for establishing a stable grasp.
  3. Placement reward: Reward for moving the object toward the goal.

Each stage is activated by a condition mask (e.g., the placement reward is only added once the robot has reached the object), creating a natural curriculum within the reward function.

Sparse vs Dense Rewards

Sparse rewards (success/failure only) define the true task objective but provide no gradient information for learning. Dense rewards provide guidance but may introduce reward hacking if poorly designed. ManiSkill's multi-mode system allows researchers to:

  • Train with dense rewards for faster learning.
  • Evaluate with sparse rewards for unbiased task success measurement.
  • Use normalized dense rewards for consistent hyperparameter settings across tasks.

Reward Normalization

The compute_normalized_dense_reward() method is expected to return values in the [0, 1] range. This is computed by dividing the dense reward by the maximum possible reward. Normalization facilitates:

  • Consistent learning rate tuning across different tasks.
  • Fair comparison of learning curves across tasks with different reward scales.
  • Easier composition of multi-task learning objectives.

Separation of Evaluation and Reward

The evaluate() and compute_dense_reward() separation follows the single responsibility principle:

  • evaluate() answers: Is the task done? How?
  • compute_dense_reward() answers: How much progress was made?

This separation is important because evaluation criteria should remain stable (they define the task), while reward functions may be redesigned to improve learning without changing what counts as success.

Termination Semantics

ManiSkill uses the Gymnasium convention of distinguishing termination (task success or failure) from truncation (time limit reached):

  • If evaluate() returns "success": True, the episode terminates with a positive outcome.
  • If evaluate() returns "fail": True, the episode terminates with a negative outcome.
  • If max_episode_steps is reached (handled by TimeLimitWrapper), the episode is truncated.

This distinction is critical for correct value function bootstrapping in RL algorithms.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment