Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Haosulab ManiSkill Get Obs Extra CameraConfig

From Leeroopedia
Field Value
Page Type Implementation (Pattern Doc)
Title ManiSkill _get_obs_extra and CameraConfig
Domain Simulation, Robotics, Environment_Design, Computer_Vision
Related Principle Principle:Haosulab_ManiSkill_Observation_Definition
Source Files mani_skill/envs/sapien_env.py (L558-560), mani_skill/sensors/camera.py (L32-62)
Date 2026-02-15
Repository Haosulab/ManiSkill

Overview

Description

This document describes two APIs for defining observations in a custom ManiSkill task:

  • _get_obs_extra(): A method on BaseEnv that task developers override to inject task-specific observation data (goal positions, relative poses, grasp indicators) into the observation dictionary.
  • CameraConfig: A dataclass used to configure camera sensors for visual observation modes. Camera configurations are returned by the _default_sensor_configs and _default_human_render_camera_configs properties.

Together, these two mechanisms define what the agent observes: _get_obs_extra() provides the semantic layer (what task-relevant facts are exposed), while CameraConfig provides the perceptual layer (what visual data is captured).

Usage

from mani_skill.sensors.camera import CameraConfig

Override _get_obs_extra() in your BaseEnv subclass and _default_sensor_configs property to configure cameras.

Code Reference

_get_obs_extra Interface (sapien_env.py L558-560)

def _get_obs_extra(self, info: dict) -> dict:
    """Get task-relevant extra observations. Usually defined on a task by task basis.

    Args:
        info (dict): The info dictionary from self.evaluate(). Contains
            success/fail flags and any other computed data.

    Returns:
        dict: Mapping of observation names to torch.Tensor values.
            Each tensor should have shape (num_envs, ...).
            Returns empty dict by default.
    """
    return dict()

CameraConfig Dataclass (camera.py L32-62)

@dataclass
class CameraConfig(BaseSensorConfig):

    uid: str
    """Unique id of the camera."""

    pose: Pose
    """Pose of the camera (sapien.Pose or Pose object)."""

    width: int
    """Width of the rendered image in pixels."""

    height: int
    """Height of the rendered image in pixels."""

    fov: float = None
    """Field of view in radians. Either fov or intrinsic must be given."""

    near: float = 0.01
    """Near clipping plane distance."""

    far: float = 100
    """Far clipping plane distance."""

    intrinsic: Array = None
    """Camera intrinsics matrix (3x3). Either fov or intrinsic must be given."""

    entity_uid: Optional[str] = None
    """UID of the entity to mount the camera on. Used by agent classes for
    defining mounted cameras (e.g., wrist cameras)."""

    mount: Union[Actor, Link] = None
    """The Actor or Link to mount the camera on. The camera's global pose
    becomes mount.pose * local_pose."""

    shader_pack: Optional[str] = "minimal"
    """Shader for rendering. Options: 'minimal' (fastest), 'default', 'rt' (ray-tracing)."""

    shader_config: Optional[ShaderConfig] = None
    """Explicit shader config. Overrides shader_pack if given."""

Sensor Config Properties (sapien_env.py)

@property
def _default_sensor_configs(self) -> Union[
    BaseSensorConfig, Sequence[BaseSensorConfig], dict[str, BaseSensorConfig]
]:
    """Return sensor configurations for agent observation cameras.
    Override to add task-specific cameras. Returns list, dict, or single config."""
    return []

@property
def _default_human_render_camera_configs(self) -> Union[
    CameraConfig, Sequence[CameraConfig], dict[str, CameraConfig]
]:
    """Return camera configurations for human rendering (render_mode='rgb_array').
    Typically higher resolution than sensor cameras."""
    return []

I/O Contract

_get_obs_extra

Parameter Type Description
info dict Info dictionary from self.evaluate(). Contains keys like "success", "fail", and any task-specific computed data.

Returns: dict mapping string keys to torch.Tensor values. Each tensor must have batch dimension self.num_envs as the first axis.

Note: Use self.obs_mode_struct.use_state to conditionally include ground-truth information only in state-based observation modes. This prevents leaking privileged state info in visual observation modes.

CameraConfig

Field Type Required Default Description
uid str Yes -- Unique camera identifier
pose Pose or sapien.Pose Yes -- Camera pose in world frame (or local frame if mounted)
width int Yes -- Image width in pixels
height int Yes -- Image height in pixels
fov float Conditional None Field of view in radians. Required if intrinsic is not set.
near float No 0.01 Near clipping plane
far float No 100 Far clipping plane
intrinsic Array Conditional None 3x3 intrinsics matrix. Required if fov is not set.
entity_uid str No None Entity UID for mounting (agent camera use)
mount Actor or Link No None Object to mount camera on
shader_pack str No "minimal" Rendering shader: "minimal", "default", or "rt"

Constraint: Exactly one of fov or intrinsic must be provided (not both, not neither).

Usage Examples

Task-Specific Observations with State/Visual Branching

def _get_obs_extra(self, info: dict):
    # Always include TCP pose (available in all modes)
    obs = dict(
        tcp_pose=self.agent.tcp.pose.raw_pose,
    )
    if self.obs_mode_struct.use_state:
        # Only include ground-truth object/goal info in state modes
        obs.update(
            goal_pos=self.goal_region.pose.p,
            obj_pose=self.obj.pose.raw_pose,
        )
    return obs

Configuring a Sensor Camera

from mani_skill.sensors.camera import CameraConfig
from mani_skill.utils import sapien_utils

@property
def _default_sensor_configs(self):
    # Create a camera looking at the workspace
    pose = sapien_utils.look_at(eye=[0.3, 0, 0.6], target=[-0.1, 0, 0.1])
    return [
        CameraConfig(
            "base_camera",
            pose=pose,
            width=128,
            height=128,
            fov=np.pi / 2,
            near=0.01,
            far=100,
        )
    ]

Configuring a Human Render Camera

@property
def _default_human_render_camera_configs(self):
    # Higher resolution camera for video recording
    pose = sapien_utils.look_at([0.6, 0.7, 0.6], [0.0, 0.0, 0.35])
    return CameraConfig(
        "render_camera",
        pose=pose,
        width=512,
        height=512,
        fov=1,
        near=0.01,
        far=100,
    )

Using Info Dict to Avoid Recomputation

def evaluate(self):
    # Compute grasp state (expensive)
    is_grasped = self.agent.is_grasping(self.obj)
    obj_to_goal = torch.linalg.norm(
        self.obj.pose.p - self.goal_pose.p, axis=1
    )
    success = (obj_to_goal < 0.05) & is_grasped
    return {
        "success": success,
        "is_grasped": is_grasped,
        "obj_to_goal_dist": obj_to_goal,
    }

def _get_obs_extra(self, info: dict):
    obs = dict(tcp_pose=self.agent.tcp.pose.raw_pose)
    if self.obs_mode_struct.use_state:
        obs["obj_pose"] = self.obj.pose.raw_pose
        obs["goal_pose"] = self.goal_pose.raw_pose
        # Reuse computed data from evaluate() via info
        obs["is_grasped"] = info["is_grasped"].float().unsqueeze(-1)
    return obs

Multiple Cameras (Sensor + Wrist)

@property
def _default_sensor_configs(self):
    overhead = CameraConfig(
        "overhead_cam",
        pose=sapien_utils.look_at([0, 0, 1.0], [0, 0, 0]),
        width=128,
        height=128,
        fov=np.pi / 3,
    )
    # Wrist camera (mounted on robot hand link)
    wrist = CameraConfig(
        "wrist_cam",
        pose=sapien.Pose(p=[0, 0, 0.05]),
        width=84,
        height=84,
        fov=np.pi / 2,
        entity_uid="panda_hand",  # mounts on this link of the robot
    )
    return [overhead, wrist]

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment