Principle:Haosulab ManiSkill Camera Depth Sensor Abstraction

Knowledge Sources	Haosulab_ManiSkill ManiSkill Docs
Domains	Robotics, Simulation, Perception
Last Updated	2026-02-15 08:00 GMT

Overview

A sensor system provides a unified abstraction for simulated perception devices -- cameras, depth sensors, and other modalities -- that generate observations from the simulated world, enabling vision-based robot learning with configurable fidelity and noise characteristics.

Description

Robots perceive their environment through sensors: RGB cameras, depth cameras, stereo camera pairs, tactile sensors, and proprioceptive encoders. In simulation, these sensors must be faithfully reproduced so that policies trained on simulated sensor data can transfer to real robots. The Sensor System principle defines an abstraction layer for sensor configuration, placement, rendering, and data extraction.

At the foundation is a base sensor class that establishes the interface: each sensor has a unique identifier, a configuration specifying its intrinsic parameters (resolution, field of view, near/far planes), and methods for capturing observations. The base sensor also defines a configuration dataclass that can be serialized and included in robot agent definitions or environment specifications.

On top of this base, specialized sensor implementations handle specific modalities. The stereo depth camera simulates realistic depth sensing by rendering from two virtual cameras at a configurable baseline separation and running a stereo matching algorithm to produce depth maps. This more accurately models the noise characteristics of real depth sensors (like Intel RealSense) compared to direct Z-buffer readout. Shader configurations control which rendering passes are executed (RGB, depth, segmentation, normals, albedo), allowing users to trade rendering cost for observation richness.

Sensors can be mounted at fixed positions in the scene (third-person cameras) or attached to robot links (wrist cameras, head cameras). The sensor system handles the bookkeeping of updating camera poses as the robot moves, rendering all cameras across all parallel environments, and packaging the resulting images into the observation dictionary.

Usage

This principle applies whenever:

A task requires visual observations (RGB, depth, segmentation, point clouds) rather than or in addition to state-based observations.
Realistic depth sensor simulation is needed, with stereo matching noise rather than perfect Z-buffer depth.
Custom rendering passes (segmentation maps, surface normals, albedo) are needed for specific training approaches.
Camera sensors must be attached to moving robot links and update their poses automatically.
Sensor configurations must be specified declaratively as part of robot or environment definitions.

Theoretical Basis

1. Sensor Abstraction: Each sensor is defined by a configuration dataclass and a runtime class. The configuration specifies static parameters (resolution, intrinsic matrix, noise model) and is serializable for inclusion in agent or environment definitions. The runtime class manages the underlying rendering resources and provides a capture() method that returns observation data.

2. Camera Model: Virtual cameras use the pinhole camera model characterized by intrinsic parameters: focal lengths (fx, fy), principal point (cx, cy), and image dimensions (width, height). Extrinsic parameters (the camera's pose in world coordinates) are determined by the sensor's mount point. The near and far clipping planes define the range of depths that can be captured.

3. Stereo Depth Simulation: Real depth cameras (Intel RealSense, Microsoft Kinect) use stereo matching or structured light to estimate depth. The stereo depth camera sensor simulates this by:

Rendering RGB and depth from two virtual cameras separated by a baseline distance.
Running a semi-global stereo matching algorithm on the image pair.
Converting disparity to depth using the baseline and focal length: depth = baseline * focal_length / disparity.
This produces depth maps with realistic noise patterns: missing values in textureless regions, depth-dependent noise, and occlusion artifacts at object boundaries.

4. Shader Configuration: The rendering pipeline supports multiple output modalities controlled by shader configurations:

RGB: Standard color rendering with physically-based materials and lighting.
Depth: Per-pixel depth values from the camera's Z-buffer.
Segmentation: Per-pixel object or part identifiers for instance and semantic segmentation.
Normals: Per-pixel surface normal vectors.
Albedo: Per-pixel base color without lighting effects.

Each modality can be enabled or disabled independently, and custom shader configurations can be defined for specialized rendering needs.

5. Multi-Camera Batched Rendering: In GPU-parallelized simulation, all cameras across all environments are rendered in a single batched call. The sensor system manages the mapping between logical sensors (identified by name) and physical render targets, ensuring efficient GPU utilization.

Related Pages

Implementation:Haosulab_ManiSkill_BaseSensor -- Abstract base class and configuration for all sensors.
Implementation:Haosulab_ManiSkill_StereoDepthCamera -- Stereo depth camera with realistic depth noise simulation.
Implementation:Haosulab_ManiSkill_ShaderConfig -- Shader configuration system for controlling render output modalities.

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment