Principle:Haosulab ManiSkill Observation Definition

Field	Value
Page Type	Principle
Title	ManiSkill Observation Definition
Domain	Simulation, Robotics, Environment_Design, Reinforcement_Learning, Computer_Vision
Related Implementation	Implementation:Haosulab_ManiSkill_Get_Obs_Extra_CameraConfig
Date	2026-02-15
Repository	Haosulab/ManiSkill

Overview

Description

Observation definition in ManiSkill involves two complementary aspects: specifying task-specific state information (extra observations) and configuring sensor hardware (cameras) that capture visual data from the simulation.

ManiSkill supports multiple observation modes that determine what data the environment returns:

state / state_dict: Ground-truth state information including robot proprioception (joint positions, velocities) and task-specific extras (object poses, goal positions). The state mode returns a flattened tensor; state_dict returns a nested dictionary.
sensor_data / rgb / depth / rgbd / pointcloud: Visual observation modes that render images from cameras configured in the environment. These modes capture pixel data from one or more cameras and include it in the observation alongside proprioceptive state.
none: No observations are returned.

The observation pipeline is composed of three layers:

Agent proprioception (_get_obs_agent()): Automatically provided by the robot agent. Includes joint positions, velocities, and controller state. Task developers typically do not need to override this.
Task-specific extras (_get_obs_extra()): The primary hook for task developers. Returns a dictionary of torch.Tensor values representing task-relevant information such as goal positions, relative poses, grasp state indicators, or any other computed features.
Sensor data (_default_sensor_configs): Camera configurations that define where cameras are placed, their resolution, field of view, and rendering shader. Cameras can be mounted on static locations or attached to robot links (e.g., wrist cameras).

A critical design principle is the distinction between state observations and visual observations. When obs_mode is a state mode, the task may include ground-truth information (like exact object poses) that would not be available in a real-world scenario. When the observation mode is visual, the agent should rely on rendered images to infer object states. The self.obs_mode_struct.use_state flag indicates whether state information should be included, allowing a single _get_obs_extra() implementation to serve both modes.

Usage

Observation definition is performed when implementing a custom task by:

Overriding _get_obs_extra() to return task-specific observation tensors.
Overriding the _default_sensor_configs property to configure cameras for visual observation modes.
Optionally overriding _default_human_render_camera_configs for higher-quality cameras used only for human viewing / video recording.

The developer uses the info dictionary (produced by evaluate()) to avoid recomputing expensive data. For example, if evaluate() computes whether the robot is grasping an object, this boolean can be passed through info and included in observations without redundant computation.

Theoretical Basis

Observation design for robot learning draws on several principles:

Observation design for RL/IL: The choice of observations significantly affects learning performance. Dense state observations (object positions, orientations) enable faster learning in simulation, while visual observations (RGB, depth) are necessary for sim-to-real transfer. ManiSkill's multi-mode observation system allows researchers to develop tasks with state-based observations first and then switch to visual observations for deployment-oriented training.

Sensor configuration: Cameras in ManiSkill model real-world camera systems with configurable intrinsic parameters (field of view, resolution, near/far planes) and extrinsic parameters (pose relative to the world or a robot link). This enables realistic sensor simulation:
- Third-person cameras: Static cameras observing the workspace, similar to overhead or side-mounted cameras in real lab setups.
- Wrist cameras: Cameras mounted on the robot's end-effector link, providing an ego-centric viewpoint that moves with the robot.
- Stereo depth cameras: For realistic depth estimation simulation.

State-based vs visual observations: Following the practice in robotics RL research, ManiSkill distinguishes between state information (directly accessible simulation data) and sensor observations (rendered visual data). State observations serve as an upper bound on what the agent could learn with perfect perception, while visual observations test the combined perception-and-control pipeline.

Observation space autodiscovery: ManiSkill automatically constructs the observation_space from the first observation returned during reset(). This avoids the error-prone practice of manually specifying observation space bounds and shapes.

Information reuse through the info dict: The info parameter in _get_obs_extra(info) is the same dictionary returned by evaluate(). This implements the compute once, use everywhere pattern -- expensive computations (distance calculations, collision checks) are performed once in evaluate() and shared across observation generation, reward computation, and termination checking.

Related Pages

Implementation:Haosulab_ManiSkill_Get_Obs_Extra_CameraConfig -- Concrete observation and camera configuration APIs
Principle:Haosulab_ManiSkill_Episode_Initialization -- Initialization determines what is observable
Principle:Haosulab_ManiSkill_Reward_Success_Design -- Rewards may use the same info as observations
Principle:Haosulab_ManiSkill_Environment_Testing -- Verifying observations through testing
Heuristic:Haosulab_ManiSkill_Rendering_Memory_Optimization

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment