Principle:LaurentMazare Tch rs Vectorized Environment

Knowledge Sources	LaurentMazare_Tch_rs OpenAI Gym
Domains	Reinforcement Learning, Parallel Computing
Last Updated	2026-02-08 00:00 GMT

Overview

Vectorized environments run multiple independent environment instances in parallel, batching observations and actions to improve throughput for reinforcement learning training.

Description

A vectorized environment wraps multiple instances of the same environment to execute them simultaneously, providing a batch interface rather than a single-instance interface. This is a critical optimization for modern reinforcement learning:

Parallel execution: Instead of stepping one environment at a time, all $N$ environments are stepped with a batch of $N$ actions in a single call. This enables efficient use of hardware parallelism (CPU multiprocessing or GPU batched computation) and dramatically increases sample throughput.

Batched observations: Observations from all environments are stacked into a single tensor with an additional batch dimension. If each individual observation has shape $(d_{1}, d_{2}, \dots)$ , the batched observation has shape $(N, d_{1}, d_{2}, \dots)$ . This allows the policy network to process all observations in a single forward pass, maximizing GPU utilization.

Automatic reset on episode termination: When an individual environment's episode ends (done flag is true), it is automatically reset to a new initial state. The observation returned for that environment slot is the new initial observation from the reset, not the terminal observation. This eliminates the need for the agent to manually manage episode boundaries.

Independent episodes: Each environment instance runs its own independent episode. Episodes start and end at different times across the batch, providing a continuous stream of experience with diverse states.

Usage

Vectorized environments are used in virtually all modern RL training pipelines, especially with on-policy algorithms (A2C, PPO) that benefit from large batches of experience. They are essential for achieving competitive wall-clock training times in environments with fast simulation but high sample requirements.

Theoretical Basis

Vectorized Interface:

Given $N$ parallel environments $E_{1}, E_{2}, \dots, E_{N}$ :

RESET():
    for i = 1 to N:
        obs_i := E_i.reset()
    return stack(obs_1, ..., obs_N)    // shape: (N, *obs_shape)

STEP(actions):     // actions shape: (N, *action_shape)
    for i = 1 to N:
        obs_i, reward_i, done_i, info_i := E_i.step(actions[i])
        if done_i:
            obs_i := E_i.reset()       // auto-reset
    return stack(obs), stack(rewards), stack(dones), infos

Throughput Analysis:

For a policy network with inference time $t_{n e t}$ and environment step time $t_{e n v}$ :

Sequential: Total time per step = $N \cdot (t_{e n v} + t_{n e t})$
Vectorized (parallel envs, batched inference): Total time per step = $t_{e n v} + t_{n e t} (N)$

where $t_{n e t} (N)$ is the time for a single batched forward pass with batch size $N$ . Since $t_{n e t} (N) ≪ N \cdot t_{n e t} (1)$ on GPUs, the speedup can be substantial.

Batch Size and Variance:

With $N$ parallel environments each contributing $T$ steps, the effective batch size is $B = N \times T$ . The variance of the policy gradient estimate scales as:

$Var [{\hat{\nabla}}_{θ} J] \propto \frac{1}{B}$

Larger batches from more parallel environments reduce gradient variance, enabling larger learning rates and more stable training.

Auto-Reset Semantics:

The auto-reset mechanism ensures that the observation tensor always contains valid initial or mid-episode observations. For correct advantage computation, the terminal observation (before reset) must be preserved separately to compute the bootstrapped value:

$V_{t e r m i n a l} = {\begin{cases} 0 & if truly terminal (episode complete) \\ V (s_{T + 1}) & if truncated (time limit) \end{cases}$

Related Pages

Implementation:LaurentMazare_Tch_rs_VecGymEnv

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment