Implementation:Isaac sim IsaacGymEnvs PBT Engine

Knowledge Sources	IsaacGymEnvs
Domains	Hyperparameter_Optimization, Evolutionary_Computing
Last Updated	2026-02-15 11:00 GMT

Overview

The PBT (Population-Based Training) engine implements evolutionary hyperparameter optimization by running a population of RL policies in parallel, periodically replacing the worst performers with mutated copies of the best.

Description

The PBT system in isaacgymenvs/pbt/pbt.py consists of three main components: PbtParams (a configuration container), PbtAlgoObserver (the core PBT logic as an rl_games AlgoObserver), and the initial_pbt_check() entry point function. The system operates by running multiple independent training processes (one per policy), each with its own hyperparameters. At regular intervals defined by interval_steps, each process checkpoints its current state and evaluates the population.

PbtAlgoObserver hooks into the rl_games training loop via the AlgoObserver interface. Its process_infos() method tracks per-environment objective values (either a custom "true_objective" from the environment or the default mean reward). The after_steps() method implements the core PBT logic: it saves a checkpoint of the current policy, loads checkpoint data from all other policies in the population, ranks them by objective value, and decides whether the current policy should be replaced. If the current policy ranks in the worst fraction, it is replaced by a mutated copy of one of the best policies, and the process restarts itself via os.execv() with new hyperparameters.

The mutation logic (in a separate mutation.py module) applies random perturbations to mutable hyperparameters within configured bounds. The helper function _restart_process_with_new_params() reconstructs the command-line arguments with updated parameters and re-executes the Python process. This design allows PBT to work with any IsaacGymEnvs task without modifying the underlying RL algorithm.

Usage

Use the PBT engine when you want to automatically tune hyperparameters (e.g., learning rate, entropy coefficient, reward scales) during training. It requires launching multiple training processes, typically one per GPU or one per policy index, each configured with pbt.enabled=True and a unique pbt.policy_idx.

Code Reference

Source Location

Repository: IsaacGymEnvs
File: isaacgymenvs/pbt/pbt.py
Lines: 1-692

Signature

class PbtParams:
    def __init__(self, cfg: DictConfig):
        """Parse PBT configuration: replace_fraction_best/worst, mutation_rate,
        change_min/max, interval_steps, start_after, workspace, params_to_mutate."""

class PbtAlgoObserver(AlgoObserver):
    def __init__(self, cfg: DictConfig):
        """Initialize PBT observer with policy index, workspace, and tracking state."""

    def after_init(self, algo):
        """Store reference to RL algorithm and create workspace directories."""

    def process_infos(self, infos, done_indices):
        """Track per-environment objective values from info dict."""

    def after_steps(self):
        """Main PBT logic: checkpoint, load population, rank, replace worst with mutated best."""

    def _save_pbt_checkpoint(self):
        """Save current policy checkpoint and objective value to workspace."""

    def _load_population_checkpoints(self):
        """Load checkpoint metadata from all policies in the population."""

    def _cleanup(self):
        """Remove old checkpoint files from workspace."""

def initial_pbt_check(cfg: DictConfig):
    """Entry point: on first run, mutate initial hyperparameters and restart."""

def _restart_process_with_new_params(policy_idx, new_params, restart_from_checkpoint,
                                      experiment_name, algo, with_wandb):
    """Restart the current process via os.execv with updated hyperparameters."""

Import

from isaacgymenvs.pbt.pbt import PbtAlgoObserver, PbtParams, initial_pbt_check

I/O Contract

Inputs

Name	Type	Required	Description
cfg	DictConfig	Yes	Hydra configuration containing the pbt section with enabled, policy_idx, num_policies, workspace, interval_steps, start_after, initial_delay, mutation, replace_fraction_best, replace_fraction_worst, mutation_rate, change_min, change_max
infos	dict	Yes	Environment info dict, optionally containing "true_objective" tensor
done_indices	Tensor	Yes	Indices of environments that completed an episode
algo	RLAlgo	Yes	Reference to the rl_games algorithm instance (passed via after_init)

Outputs

Name	Type	Description
checkpoint files	YAML + PTH	Per-iteration checkpoint files written to the PBT workspace directory containing objective values and model weights
process restart	side effect	When a policy is replaced, the process restarts itself with mutated hyperparameters via os.execv

Usage Examples

# Enable PBT in your Hydra config:
# pbt:
#   enabled: True
#   policy_idx: ${pbt_policy_idx:0}
#   num_policies: 8
#   workspace: "pbt_workspace"
#   interval_steps: 10000000
#   start_after: 10000000
#   initial_delay: 0
#   replace_fraction_best: 0.3
#   replace_fraction_worst: 0.2
#   mutation_rate: 0.8
#   change_min: 0.8
#   change_max: 1.2
#   mutation:
#     train.params.config.learning_rate: float
#     train.params.config.entropy_coef: float

# In train.py, the observer is created and attached:
from isaacgymenvs.pbt.pbt import PbtAlgoObserver, initial_pbt_check

if cfg.pbt.enabled:
    initial_pbt_check(cfg)
    pbt_observer = PbtAlgoObserver(cfg)
    runner.algo_observer = pbt_observer

# Launch multiple policies:
# python train.py pbt.policy_idx=0 &
# python train.py pbt.policy_idx=1 &
# ...
# python train.py pbt.policy_idx=7 &

Related Pages

Principle:Isaac_sim_IsaacGymEnvs_Population_Based_Training

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment