Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Workflow:Farama Foundation Gymnasium Vectorized Environment Training

From Leeroopedia
Knowledge Sources
Domains Reinforcement_Learning, Parallel_Computing, Deep_RL
Last Updated 2026-02-15 03:00 GMT

Overview

End-to-end process for training a deep reinforcement learning agent using vectorized (parallel) environments with the Advantage Actor-Critic (A2C) algorithm.

Description

This workflow demonstrates how to accelerate RL training by running multiple environment instances in parallel using Gymnasium's vectorized environment infrastructure. It implements A2C (the synchronous variant of A3C) with separate actor and critic neural networks, Generalized Advantage Estimation (GAE) for computing advantages, and batched data collection across multiple environments. The workflow covers creating vectorized environments with gym.make_vec or SyncVectorEnv/AsyncVectorEnv, feeding batched states through neural networks, computing GAE-based losses, and optional domain randomization for more robust training. The primary example trains on LunarLander-v3 with 10 parallel environments.

Usage

Execute this workflow when you need to train a deep RL agent faster by collecting experience from multiple environments simultaneously. This is appropriate when single-environment training is too slow, when you want to reduce gradient variance through parallel sampling, or when you want to apply domain randomization across environment instances for more robust policies. Use vectorized environments when your training algorithm can consume batched transitions (A2C, PPO, and most on-policy methods).

Execution Steps

Step 1: Vectorized Environment Creation

Create a batch of parallel environments using one of three approaches: gym.make_vec for identical environments with a specified num_envs count, SyncVectorEnv for serial execution with custom per-environment configuration, or AsyncVectorEnv for true multiprocess parallelism. Configure environment parameters, max_episode_steps, and optionally apply domain randomization by varying parameters (gravity, wind, etc.) across instances.

Key considerations:

  • gym.make_vec is the simplest approach for identical environments
  • SyncVectorEnv runs environments sequentially in one process (simpler but slower)
  • AsyncVectorEnv uses multiprocessing for real parallelism (faster but more complex)
  • Access single_observation_space and single_action_space for individual environment specs
  • Domain randomization uses different parameter values per environment for robustness

Step 2: Actor Critic Network Definition

Define separate actor and critic neural networks using PyTorch. The critic network estimates state values V(s) and outputs a single scalar per state. The actor network outputs action logits for a categorical distribution over discrete actions (or mean/std for continuous actions). Both networks process batched inputs where the batch dimension corresponds to the number of parallel environments.

Key considerations:

  • The critic network should have a larger learning rate than the actor for stable value targets
  • Use separate optimizers for actor and critic networks
  • Networks must handle batched input of shape [n_envs, obs_dims]
  • For discrete actions, use Categorical distribution; for continuous, use Normal distribution

Step 3: Batched Data Collection

Collect transitions across all parallel environments for a fixed number of steps per update phase. At each timestep, pass the batched observations through the actor-critic networks to get actions, log-probabilities, state value estimates, and policy entropy for all environments simultaneously. Execute the actions in the vectorized environment and store the resulting rewards and termination masks. Note that vectorized environments auto-reset upon episode completion.

Key considerations:

  • Collect n_steps_per_update steps across n_envs environments per sampling phase
  • Total transitions per update = n_steps_per_update * n_envs
  • Vectorized environments auto-reset: no manual env.reset() needed during collection
  • Store termination masks (1 for ongoing, 0 for terminated) for return computation
  • Wrap with RecordEpisodeStatistics for tracking per-episode performance

Step 4: GAE Loss Computation and Network Update

Compute Generalized Advantage Estimation (GAE) by iterating backwards through the collected timesteps, combining TD errors with exponential discounting controlled by gamma and lambda parameters. Calculate the critic loss as the mean squared advantages and the actor loss as the negative mean of advantages times log-probabilities, minus an entropy bonus weighted by ent_coef to encourage exploration. Update both networks using their respective optimizers.

Key considerations:

  • GAE with lambda=1 gives Monte Carlo returns (high variance, no bias)
  • GAE with lambda=0 gives standard TD learning (low variance, biased)
  • The entropy bonus encourages exploration and prevents premature policy collapse
  • Clip gradients if training becomes unstable

Step 5: Training Visualization and Model Persistence

Plot training curves including episode returns (moving average), critic loss, actor loss, and policy entropy over the course of training. Optionally save the trained actor and critic network weights using torch.save for later loading and evaluation. Run showcase episodes with the trained agent to visually verify learned behavior.

Key considerations:

  • Use moving averages (e.g., window of 20) to smooth noisy training curves
  • Decreasing entropy indicates the policy is becoming more deterministic
  • Save both actor and critic weights for complete model persistence
  • Load weights with torch.load and set networks to eval mode for inference

Execution Diagram

GitHub URL

Workflow Repository