Principle:ARISE Initiative Robosuite Camera Projection Utilities
| Knowledge Sources | |
|---|---|
| Domains | Robotics, Computer Vision, 3D Geometry |
| Last Updated | 2026-02-15 07:00 GMT |
Overview
A set of utility functions for computing camera intrinsic and extrinsic matrices from simulation state, projecting 3D world points to 2D pixel coordinates, deprojecting pixels back to 3D rays, and generating point clouds from depth images.
Description
Simulated cameras in a physics engine capture images of the scene, but many robotics applications require the ability to transform between the 3D world coordinate frame and the 2D image pixel frame. Camera projection utilities provide the mathematical bridge between these representations by computing the standard camera matrices (intrinsic and extrinsic) from the simulator's camera parameters and using them for projection, deprojection, and point cloud generation.
The intrinsic matrix encodes the camera's optical properties: focal length and principal point. In simulation, the focal length is derived from the camera's vertical field of view and the image height. The extrinsic matrix encodes the camera's pose in the world frame, which is obtained from the simulator's camera position and rotation data. A correction is applied to align the simulator's camera body axis convention (where axes may differ from the standard OpenCV convention) with the standard convention where the z-axis points along the viewing direction.
With these matrices, world-to-pixel projection multiplies a homogeneous 3D point by the combined intrinsic-extrinsic transform and normalizes. Pixel-to-world deprojection inverts this process using depth information to recover the 3D point. Point cloud generation applies deprojection to every pixel in a depth image, producing a dense 3D point cloud in either camera or world coordinates. These operations are fundamental for tasks involving visual servoing, grasp planning from depth images, and sim-to-real transfer of vision-based policies.
Usage
Use these utilities when working with simulated camera observations that need to be converted between pixel space and world space. Common applications include computing 3D object positions from pixel coordinates with known depth, generating point clouds for grasp planning, and setting up camera parameters for sim-to-real transfer.
Theoretical Basis
Intrinsic matrix from field of view:
f = (H / 2) / tan(fovy * pi / 360)
K = | f 0 W/2 |
| 0 f H/2 |
| 0 0 1 |
where H = image height, W = image width, fovy = vertical FOV in degrees
Extrinsic matrix with axis correction:
R_world = make_pose(cam_pos, cam_rot) -- 4x4 homogeneous pose
camera_axis_correction = | 1 0 0 0 |
| 0 -1 0 0 |
| 0 0 -1 0 |
| 0 0 0 1 |
R = R_world @ camera_axis_correction
The axis correction flips y and z axes to convert from MuJoCo's camera convention to the standard computer vision convention.
World-to-pixel projection:
p_cam = K @ R_inv @ [x_w, y_w, z_w, 1]^T
u = p_cam[0] / p_cam[2]
v = p_cam[1] / p_cam[2]
Pixel-to-world deprojection:
x_cam = (u - W/2) * depth / f
y_cam = (v - H/2) * depth / f
z_cam = depth
p_world = R @ [x_cam, y_cam, z_cam, 1]^T
Point cloud generation applies deprojection to all pixels in a depth image, optionally transforming results from camera frame to world frame.