Principle:LaurentMazare Tch rs Vectorized Environment
| Knowledge Sources | |
|---|---|
| Domains | Reinforcement Learning, Parallel Computing |
| Last Updated | 2026-02-08 00:00 GMT |
Overview
Vectorized environments run multiple independent environment instances in parallel, batching observations and actions to improve throughput for reinforcement learning training.
Description
A vectorized environment wraps multiple instances of the same environment to execute them simultaneously, providing a batch interface rather than a single-instance interface. This is a critical optimization for modern reinforcement learning:
- Parallel execution: Instead of stepping one environment at a time, all environments are stepped with a batch of actions in a single call. This enables efficient use of hardware parallelism (CPU multiprocessing or GPU batched computation) and dramatically increases sample throughput.
- Batched observations: Observations from all environments are stacked into a single tensor with an additional batch dimension. If each individual observation has shape , the batched observation has shape . This allows the policy network to process all observations in a single forward pass, maximizing GPU utilization.
- Automatic reset on episode termination: When an individual environment's episode ends (done flag is true), it is automatically reset to a new initial state. The observation returned for that environment slot is the new initial observation from the reset, not the terminal observation. This eliminates the need for the agent to manually manage episode boundaries.
- Independent episodes: Each environment instance runs its own independent episode. Episodes start and end at different times across the batch, providing a continuous stream of experience with diverse states.
Usage
Vectorized environments are used in virtually all modern RL training pipelines, especially with on-policy algorithms (A2C, PPO) that benefit from large batches of experience. They are essential for achieving competitive wall-clock training times in environments with fast simulation but high sample requirements.
Theoretical Basis
Vectorized Interface:
Given parallel environments :
RESET():
for i = 1 to N:
obs_i := E_i.reset()
return stack(obs_1, ..., obs_N) // shape: (N, *obs_shape)
STEP(actions): // actions shape: (N, *action_shape)
for i = 1 to N:
obs_i, reward_i, done_i, info_i := E_i.step(actions[i])
if done_i:
obs_i := E_i.reset() // auto-reset
return stack(obs), stack(rewards), stack(dones), infos
Throughput Analysis:
For a policy network with inference time and environment step time :
- Sequential: Total time per step =
- Vectorized (parallel envs, batched inference): Total time per step =
where is the time for a single batched forward pass with batch size . Since on GPUs, the speedup can be substantial.
Batch Size and Variance:
With parallel environments each contributing steps, the effective batch size is . The variance of the policy gradient estimate scales as:
Larger batches from more parallel environments reduce gradient variance, enabling larger learning rates and more stable training.
Auto-Reset Semantics:
The auto-reset mechanism ensures that the observation tensor always contains valid initial or mid-episode observations. For correct advantage computation, the terminal observation (before reset) must be preserved separately to compute the bootstrapped value: