Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Principle:Isaac sim IsaacGymEnvs Assembly Sub Policy Training

From Leeroopedia
Knowledge Sources
Domains Manipulation, Reinforcement_Learning
Last Updated 2026-02-15 00:00 GMT

Overview

Approach for training sequential manipulation sub-policies using keypoint-based and SDF-based reward shaping for contact-rich assembly tasks.

Description

Assembly tasks are decomposed into sub-policies that are trained independently and composed sequentially. For nut-bolt assembly, the sub-policies are pick, place, and screw. For IndustReal plug-socket tasks, the primary sub-policy is insert. Each sub-policy uses shaped rewards to guide learning through contact-rich manipulation:

  • Keypoint distance rewards: For approach and grasp phases, reward is based on the distance between keypoints on the robot fingertip and keypoints on the target object.
  • SDF-based rewards: For fine contact phases (insertion, threading), reward is computed from SDF queries that measure how deeply one part has entered another.
  • SAPU (Simulation-Aware Policy Update): IndustReal scales reward by an interpenetration threshold check, penalizing physically unrealistic states that arise from simulation artifacts.
  • SBC (Sampling-Based Curriculum): IndustReal progressively increases task difficulty by expanding the maximum initial displacement of objects from their goal positions.

RL actions are 6D vectors (3D position delta + 3D rotation delta) that are scaled by pos_action_scale and rot_action_scale before being passed through the controller layer (see Robot_Controller_Configuration).

Usage

Use this principle when training manipulation policies for precision assembly tasks. The training loop follows the standard IsaacGymEnvs pattern:

  1. Environment resets randomize object poses within curriculum bounds.
  2. Policy observes fingertip pose, object poses, and goal information.
  3. Policy outputs 6D action deltas.
  4. Actions are scaled and converted to joint commands via the controller.
  5. Reward is computed from keypoint distances, SDF queries, and bonus terms.
  6. Episodes terminate on success (engagement detected) or timeout.

Theoretical Basis

Keypoint Rewards

Keypoint rewards provide dense learning signal for approach and grasping phases:

R_keypoint = -sum_{i=1}^{K} ||k_robot_i - k_object_i||

where:
  K = number of keypoint pairs
  k_robot_i = i-th keypoint on robot fingertip
  k_object_i = i-th keypoint on target object

SDF Rewards

SDF rewards provide smooth gradients for insertion and threading:

R_sdf = sum_{i=1}^{N} SDF_socket(T_{socket}^{-1} * T_{plug} * p_plug_i)

where:
  N = number of sampled surface points on the plug
  T_{plug}, T_{socket} = rigid body transforms
  SDF_socket(p) < 0 indicates p is inside the socket mesh

SAPU (Simulation-Aware Policy Update)

SAPU addresses simulation artifacts by scaling rewards based on physical plausibility:

R_sapu_scale = 1.0  if max_interpenetration < interpen_thresh
             = 0.0  otherwise

R_total = R_sapu_scale * (R_keypoint + R_sdf + R_bonus)

Interpenetration is detected by querying plug surface points against the socket SDF and checking if any penetration exceeds the threshold.

SBC (Sampling-Based Curriculum)

SBC progressively increases task difficulty:

R_curriculum_scale = f(curr_max_disp)

where curr_max_disp increases over training:
  - Start: small displacement (easy initial conditions)
  - End: large displacement (full task difficulty)

Related Pages

Implemented By

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment