Principle:Haosulab ManiSkill Observation Definition
| Field | Value |
|---|---|
| Page Type | Principle |
| Title | ManiSkill Observation Definition |
| Domain | Simulation, Robotics, Environment_Design, Reinforcement_Learning, Computer_Vision |
| Related Implementation | Implementation:Haosulab_ManiSkill_Get_Obs_Extra_CameraConfig |
| Date | 2026-02-15 |
| Repository | Haosulab/ManiSkill |
Overview
Description
Observation definition in ManiSkill involves two complementary aspects: specifying task-specific state information (extra observations) and configuring sensor hardware (cameras) that capture visual data from the simulation.
ManiSkill supports multiple observation modes that determine what data the environment returns:
- state / state_dict: Ground-truth state information including robot proprioception (joint positions, velocities) and task-specific extras (object poses, goal positions). The
statemode returns a flattened tensor;state_dictreturns a nested dictionary. - sensor_data / rgb / depth / rgbd / pointcloud: Visual observation modes that render images from cameras configured in the environment. These modes capture pixel data from one or more cameras and include it in the observation alongside proprioceptive state.
- none: No observations are returned.
The observation pipeline is composed of three layers:
- Agent proprioception (
_get_obs_agent()): Automatically provided by the robot agent. Includes joint positions, velocities, and controller state. Task developers typically do not need to override this. - Task-specific extras (
_get_obs_extra()): The primary hook for task developers. Returns a dictionary oftorch.Tensorvalues representing task-relevant information such as goal positions, relative poses, grasp state indicators, or any other computed features. - Sensor data (
_default_sensor_configs): Camera configurations that define where cameras are placed, their resolution, field of view, and rendering shader. Cameras can be mounted on static locations or attached to robot links (e.g., wrist cameras).
A critical design principle is the distinction between state observations and visual observations. When obs_mode is a state mode, the task may include ground-truth information (like exact object poses) that would not be available in a real-world scenario. When the observation mode is visual, the agent should rely on rendered images to infer object states. The self.obs_mode_struct.use_state flag indicates whether state information should be included, allowing a single _get_obs_extra() implementation to serve both modes.
Usage
Observation definition is performed when implementing a custom task by:
- Overriding
_get_obs_extra()to return task-specific observation tensors. - Overriding the
_default_sensor_configsproperty to configure cameras for visual observation modes. - Optionally overriding
_default_human_render_camera_configsfor higher-quality cameras used only for human viewing / video recording.
The developer uses the info dictionary (produced by evaluate()) to avoid recomputing expensive data. For example, if evaluate() computes whether the robot is grasping an object, this boolean can be passed through info and included in observations without redundant computation.
Theoretical Basis
Observation design for robot learning draws on several principles:
- Observation design for RL/IL: The choice of observations significantly affects learning performance. Dense state observations (object positions, orientations) enable faster learning in simulation, while visual observations (RGB, depth) are necessary for sim-to-real transfer. ManiSkill's multi-mode observation system allows researchers to develop tasks with state-based observations first and then switch to visual observations for deployment-oriented training.
- Sensor configuration: Cameras in ManiSkill model real-world camera systems with configurable intrinsic parameters (field of view, resolution, near/far planes) and extrinsic parameters (pose relative to the world or a robot link). This enables realistic sensor simulation:
- Third-person cameras: Static cameras observing the workspace, similar to overhead or side-mounted cameras in real lab setups.
- Wrist cameras: Cameras mounted on the robot's end-effector link, providing an ego-centric viewpoint that moves with the robot.
- Stereo depth cameras: For realistic depth estimation simulation.
- State-based vs visual observations: Following the practice in robotics RL research, ManiSkill distinguishes between state information (directly accessible simulation data) and sensor observations (rendered visual data). State observations serve as an upper bound on what the agent could learn with perfect perception, while visual observations test the combined perception-and-control pipeline.
- Observation space autodiscovery: ManiSkill automatically constructs the
observation_spacefrom the first observation returned duringreset(). This avoids the error-prone practice of manually specifying observation space bounds and shapes.
- Information reuse through the info dict: The
infoparameter in_get_obs_extra(info)is the same dictionary returned byevaluate(). This implements the compute once, use everywhere pattern -- expensive computations (distance calculations, collision checks) are performed once inevaluate()and shared across observation generation, reward computation, and termination checking.
Related Pages
- Implementation:Haosulab_ManiSkill_Get_Obs_Extra_CameraConfig -- Concrete observation and camera configuration APIs
- Principle:Haosulab_ManiSkill_Episode_Initialization -- Initialization determines what is observable
- Principle:Haosulab_ManiSkill_Reward_Success_Design -- Rewards may use the same info as observations
- Principle:Haosulab_ManiSkill_Environment_Testing -- Verifying observations through testing
- Heuristic:Haosulab_ManiSkill_Rendering_Memory_Optimization