Principle:Isaac sim IsaacGymEnvs Assembly Sub Policy Training
| Knowledge Sources | |
|---|---|
| Domains | Manipulation, Reinforcement_Learning |
| Last Updated | 2026-02-15 00:00 GMT |
Overview
Approach for training sequential manipulation sub-policies using keypoint-based and SDF-based reward shaping for contact-rich assembly tasks.
Description
Assembly tasks are decomposed into sub-policies that are trained independently and composed sequentially. For nut-bolt assembly, the sub-policies are pick, place, and screw. For IndustReal plug-socket tasks, the primary sub-policy is insert. Each sub-policy uses shaped rewards to guide learning through contact-rich manipulation:
- Keypoint distance rewards: For approach and grasp phases, reward is based on the distance between keypoints on the robot fingertip and keypoints on the target object.
- SDF-based rewards: For fine contact phases (insertion, threading), reward is computed from SDF queries that measure how deeply one part has entered another.
- SAPU (Simulation-Aware Policy Update): IndustReal scales reward by an interpenetration threshold check, penalizing physically unrealistic states that arise from simulation artifacts.
- SBC (Sampling-Based Curriculum): IndustReal progressively increases task difficulty by expanding the maximum initial displacement of objects from their goal positions.
RL actions are 6D vectors (3D position delta + 3D rotation delta) that are scaled by pos_action_scale and rot_action_scale before being passed through the controller layer (see Robot_Controller_Configuration).
Usage
Use this principle when training manipulation policies for precision assembly tasks. The training loop follows the standard IsaacGymEnvs pattern:
- Environment resets randomize object poses within curriculum bounds.
- Policy observes fingertip pose, object poses, and goal information.
- Policy outputs 6D action deltas.
- Actions are scaled and converted to joint commands via the controller.
- Reward is computed from keypoint distances, SDF queries, and bonus terms.
- Episodes terminate on success (engagement detected) or timeout.
Theoretical Basis
Keypoint Rewards
Keypoint rewards provide dense learning signal for approach and grasping phases:
R_keypoint = -sum_{i=1}^{K} ||k_robot_i - k_object_i||
where:
K = number of keypoint pairs
k_robot_i = i-th keypoint on robot fingertip
k_object_i = i-th keypoint on target object
SDF Rewards
SDF rewards provide smooth gradients for insertion and threading:
R_sdf = sum_{i=1}^{N} SDF_socket(T_{socket}^{-1} * T_{plug} * p_plug_i)
where:
N = number of sampled surface points on the plug
T_{plug}, T_{socket} = rigid body transforms
SDF_socket(p) < 0 indicates p is inside the socket mesh
SAPU (Simulation-Aware Policy Update)
SAPU addresses simulation artifacts by scaling rewards based on physical plausibility:
R_sapu_scale = 1.0 if max_interpenetration < interpen_thresh
= 0.0 otherwise
R_total = R_sapu_scale * (R_keypoint + R_sdf + R_bonus)
Interpenetration is detected by querying plug surface points against the socket SDF and checking if any penetration exceeds the threshold.
SBC (Sampling-Based Curriculum)
SBC progressively increases task difficulty:
R_curriculum_scale = f(curr_max_disp)
where curr_max_disp increases over training:
- Start: small displacement (easy initial conditions)
- End: large displacement (full task difficulty)