Implementation:Haosulab ManiSkill Get Obs Extra CameraConfig
| Field | Value |
|---|---|
| Page Type | Implementation (Pattern Doc) |
| Title | ManiSkill _get_obs_extra and CameraConfig |
| Domain | Simulation, Robotics, Environment_Design, Computer_Vision |
| Related Principle | Principle:Haosulab_ManiSkill_Observation_Definition |
| Source Files | mani_skill/envs/sapien_env.py (L558-560), mani_skill/sensors/camera.py (L32-62)
|
| Date | 2026-02-15 |
| Repository | Haosulab/ManiSkill |
Overview
Description
This document describes two APIs for defining observations in a custom ManiSkill task:
_get_obs_extra(): A method onBaseEnvthat task developers override to inject task-specific observation data (goal positions, relative poses, grasp indicators) into the observation dictionary.
CameraConfig: A dataclass used to configure camera sensors for visual observation modes. Camera configurations are returned by the_default_sensor_configsand_default_human_render_camera_configsproperties.
Together, these two mechanisms define what the agent observes: _get_obs_extra() provides the semantic layer (what task-relevant facts are exposed), while CameraConfig provides the perceptual layer (what visual data is captured).
Usage
from mani_skill.sensors.camera import CameraConfig
Override _get_obs_extra() in your BaseEnv subclass and _default_sensor_configs property to configure cameras.
Code Reference
_get_obs_extra Interface (sapien_env.py L558-560)
def _get_obs_extra(self, info: dict) -> dict:
"""Get task-relevant extra observations. Usually defined on a task by task basis.
Args:
info (dict): The info dictionary from self.evaluate(). Contains
success/fail flags and any other computed data.
Returns:
dict: Mapping of observation names to torch.Tensor values.
Each tensor should have shape (num_envs, ...).
Returns empty dict by default.
"""
return dict()
CameraConfig Dataclass (camera.py L32-62)
@dataclass
class CameraConfig(BaseSensorConfig):
uid: str
"""Unique id of the camera."""
pose: Pose
"""Pose of the camera (sapien.Pose or Pose object)."""
width: int
"""Width of the rendered image in pixels."""
height: int
"""Height of the rendered image in pixels."""
fov: float = None
"""Field of view in radians. Either fov or intrinsic must be given."""
near: float = 0.01
"""Near clipping plane distance."""
far: float = 100
"""Far clipping plane distance."""
intrinsic: Array = None
"""Camera intrinsics matrix (3x3). Either fov or intrinsic must be given."""
entity_uid: Optional[str] = None
"""UID of the entity to mount the camera on. Used by agent classes for
defining mounted cameras (e.g., wrist cameras)."""
mount: Union[Actor, Link] = None
"""The Actor or Link to mount the camera on. The camera's global pose
becomes mount.pose * local_pose."""
shader_pack: Optional[str] = "minimal"
"""Shader for rendering. Options: 'minimal' (fastest), 'default', 'rt' (ray-tracing)."""
shader_config: Optional[ShaderConfig] = None
"""Explicit shader config. Overrides shader_pack if given."""
Sensor Config Properties (sapien_env.py)
@property
def _default_sensor_configs(self) -> Union[
BaseSensorConfig, Sequence[BaseSensorConfig], dict[str, BaseSensorConfig]
]:
"""Return sensor configurations for agent observation cameras.
Override to add task-specific cameras. Returns list, dict, or single config."""
return []
@property
def _default_human_render_camera_configs(self) -> Union[
CameraConfig, Sequence[CameraConfig], dict[str, CameraConfig]
]:
"""Return camera configurations for human rendering (render_mode='rgb_array').
Typically higher resolution than sensor cameras."""
return []
I/O Contract
_get_obs_extra
| Parameter | Type | Description |
|---|---|---|
info |
dict |
Info dictionary from self.evaluate(). Contains keys like "success", "fail", and any task-specific computed data.
|
Returns: dict mapping string keys to torch.Tensor values. Each tensor must have batch dimension self.num_envs as the first axis.
Note: Use self.obs_mode_struct.use_state to conditionally include ground-truth information only in state-based observation modes. This prevents leaking privileged state info in visual observation modes.
CameraConfig
| Field | Type | Required | Default | Description |
|---|---|---|---|---|
uid |
str |
Yes | -- | Unique camera identifier |
pose |
Pose or sapien.Pose |
Yes | -- | Camera pose in world frame (or local frame if mounted) |
width |
int |
Yes | -- | Image width in pixels |
height |
int |
Yes | -- | Image height in pixels |
fov |
float |
Conditional | None |
Field of view in radians. Required if intrinsic is not set.
|
near |
float |
No | 0.01 |
Near clipping plane |
far |
float |
No | 100 |
Far clipping plane |
intrinsic |
Array |
Conditional | None |
3x3 intrinsics matrix. Required if fov is not set.
|
entity_uid |
str |
No | None |
Entity UID for mounting (agent camera use) |
mount |
Actor or Link |
No | None |
Object to mount camera on |
shader_pack |
str |
No | "minimal" |
Rendering shader: "minimal", "default", or "rt" |
Constraint: Exactly one of fov or intrinsic must be provided (not both, not neither).
Usage Examples
Task-Specific Observations with State/Visual Branching
def _get_obs_extra(self, info: dict):
# Always include TCP pose (available in all modes)
obs = dict(
tcp_pose=self.agent.tcp.pose.raw_pose,
)
if self.obs_mode_struct.use_state:
# Only include ground-truth object/goal info in state modes
obs.update(
goal_pos=self.goal_region.pose.p,
obj_pose=self.obj.pose.raw_pose,
)
return obs
Configuring a Sensor Camera
from mani_skill.sensors.camera import CameraConfig
from mani_skill.utils import sapien_utils
@property
def _default_sensor_configs(self):
# Create a camera looking at the workspace
pose = sapien_utils.look_at(eye=[0.3, 0, 0.6], target=[-0.1, 0, 0.1])
return [
CameraConfig(
"base_camera",
pose=pose,
width=128,
height=128,
fov=np.pi / 2,
near=0.01,
far=100,
)
]
Configuring a Human Render Camera
@property
def _default_human_render_camera_configs(self):
# Higher resolution camera for video recording
pose = sapien_utils.look_at([0.6, 0.7, 0.6], [0.0, 0.0, 0.35])
return CameraConfig(
"render_camera",
pose=pose,
width=512,
height=512,
fov=1,
near=0.01,
far=100,
)
Using Info Dict to Avoid Recomputation
def evaluate(self):
# Compute grasp state (expensive)
is_grasped = self.agent.is_grasping(self.obj)
obj_to_goal = torch.linalg.norm(
self.obj.pose.p - self.goal_pose.p, axis=1
)
success = (obj_to_goal < 0.05) & is_grasped
return {
"success": success,
"is_grasped": is_grasped,
"obj_to_goal_dist": obj_to_goal,
}
def _get_obs_extra(self, info: dict):
obs = dict(tcp_pose=self.agent.tcp.pose.raw_pose)
if self.obs_mode_struct.use_state:
obs["obj_pose"] = self.obj.pose.raw_pose
obs["goal_pose"] = self.goal_pose.raw_pose
# Reuse computed data from evaluate() via info
obs["is_grasped"] = info["is_grasped"].float().unsqueeze(-1)
return obs
Multiple Cameras (Sensor + Wrist)
@property
def _default_sensor_configs(self):
overhead = CameraConfig(
"overhead_cam",
pose=sapien_utils.look_at([0, 0, 1.0], [0, 0, 0]),
width=128,
height=128,
fov=np.pi / 3,
)
# Wrist camera (mounted on robot hand link)
wrist = CameraConfig(
"wrist_cam",
pose=sapien.Pose(p=[0, 0, 0.05]),
width=84,
height=84,
fov=np.pi / 2,
entity_uid="panda_hand", # mounts on this link of the robot
)
return [overhead, wrist]
Related Pages
- Principle:Haosulab_ManiSkill_Observation_Definition -- The principle this implements
- Implementation:Haosulab_ManiSkill_Initialize_Episode_Pattern -- Initialization determines observable state
- Implementation:Haosulab_ManiSkill_Evaluate_Dense_Reward -- Evaluate produces the info dict used by _get_obs_extra
- Heuristic:Haosulab_ManiSkill_Rendering_Memory_Optimization