Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Isaac sim IsaacGymEnvs PBT Engine

From Leeroopedia
Revision as of 13:08, 16 February 2026 by Admin (talk | contribs) (Auto-imported from implementations/Isaac_sim_IsaacGymEnvs_PBT_Engine.md)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Knowledge Sources
Domains Hyperparameter_Optimization, Evolutionary_Computing
Last Updated 2026-02-15 11:00 GMT

Overview

The PBT (Population-Based Training) engine implements evolutionary hyperparameter optimization by running a population of RL policies in parallel, periodically replacing the worst performers with mutated copies of the best.

Description

The PBT system in isaacgymenvs/pbt/pbt.py consists of three main components: PbtParams (a configuration container), PbtAlgoObserver (the core PBT logic as an rl_games AlgoObserver), and the initial_pbt_check() entry point function. The system operates by running multiple independent training processes (one per policy), each with its own hyperparameters. At regular intervals defined by interval_steps, each process checkpoints its current state and evaluates the population.

PbtAlgoObserver hooks into the rl_games training loop via the AlgoObserver interface. Its process_infos() method tracks per-environment objective values (either a custom "true_objective" from the environment or the default mean reward). The after_steps() method implements the core PBT logic: it saves a checkpoint of the current policy, loads checkpoint data from all other policies in the population, ranks them by objective value, and decides whether the current policy should be replaced. If the current policy ranks in the worst fraction, it is replaced by a mutated copy of one of the best policies, and the process restarts itself via os.execv() with new hyperparameters.

The mutation logic (in a separate mutation.py module) applies random perturbations to mutable hyperparameters within configured bounds. The helper function _restart_process_with_new_params() reconstructs the command-line arguments with updated parameters and re-executes the Python process. This design allows PBT to work with any IsaacGymEnvs task without modifying the underlying RL algorithm.

Usage

Use the PBT engine when you want to automatically tune hyperparameters (e.g., learning rate, entropy coefficient, reward scales) during training. It requires launching multiple training processes, typically one per GPU or one per policy index, each configured with pbt.enabled=True and a unique pbt.policy_idx.

Code Reference

Source Location

Signature

class PbtParams:
    def __init__(self, cfg: DictConfig):
        """Parse PBT configuration: replace_fraction_best/worst, mutation_rate,
        change_min/max, interval_steps, start_after, workspace, params_to_mutate."""

class PbtAlgoObserver(AlgoObserver):
    def __init__(self, cfg: DictConfig):
        """Initialize PBT observer with policy index, workspace, and tracking state."""

    def after_init(self, algo):
        """Store reference to RL algorithm and create workspace directories."""

    def process_infos(self, infos, done_indices):
        """Track per-environment objective values from info dict."""

    def after_steps(self):
        """Main PBT logic: checkpoint, load population, rank, replace worst with mutated best."""

    def _save_pbt_checkpoint(self):
        """Save current policy checkpoint and objective value to workspace."""

    def _load_population_checkpoints(self):
        """Load checkpoint metadata from all policies in the population."""

    def _cleanup(self):
        """Remove old checkpoint files from workspace."""

def initial_pbt_check(cfg: DictConfig):
    """Entry point: on first run, mutate initial hyperparameters and restart."""

def _restart_process_with_new_params(policy_idx, new_params, restart_from_checkpoint,
                                      experiment_name, algo, with_wandb):
    """Restart the current process via os.execv with updated hyperparameters."""

Import

from isaacgymenvs.pbt.pbt import PbtAlgoObserver, PbtParams, initial_pbt_check

I/O Contract

Inputs

Name Type Required Description
cfg DictConfig Yes Hydra configuration containing the pbt section with enabled, policy_idx, num_policies, workspace, interval_steps, start_after, initial_delay, mutation, replace_fraction_best, replace_fraction_worst, mutation_rate, change_min, change_max
infos dict Yes Environment info dict, optionally containing "true_objective" tensor
done_indices Tensor Yes Indices of environments that completed an episode
algo RLAlgo Yes Reference to the rl_games algorithm instance (passed via after_init)

Outputs

Name Type Description
checkpoint files YAML + PTH Per-iteration checkpoint files written to the PBT workspace directory containing objective values and model weights
process restart side effect When a policy is replaced, the process restarts itself with mutated hyperparameters via os.execv

Usage Examples

# Enable PBT in your Hydra config:
# pbt:
#   enabled: True
#   policy_idx: ${pbt_policy_idx:0}
#   num_policies: 8
#   workspace: "pbt_workspace"
#   interval_steps: 10000000
#   start_after: 10000000
#   initial_delay: 0
#   replace_fraction_best: 0.3
#   replace_fraction_worst: 0.2
#   mutation_rate: 0.8
#   change_min: 0.8
#   change_max: 1.2
#   mutation:
#     train.params.config.learning_rate: float
#     train.params.config.entropy_coef: float

# In train.py, the observer is created and attached:
from isaacgymenvs.pbt.pbt import PbtAlgoObserver, initial_pbt_check

if cfg.pbt.enabled:
    initial_pbt_check(cfg)
    pbt_observer = PbtAlgoObserver(cfg)
    runner.algo_observer = pbt_observer

# Launch multiple policies:
# python train.py pbt.policy_idx=0 &
# python train.py pbt.policy_idx=1 &
# ...
# python train.py pbt.policy_idx=7 &

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment