Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Farama Foundation Gymnasium ReacherEnv V4

From Leeroopedia
Knowledge Sources
Domains Reinforcement_Learning, MuJoCo_Environments
Last Updated 2026-02-15 03:00 GMT

Overview

Concrete implementation of the Reacher v4 MuJoCo environment provided by Gymnasium.

Description

The Reacher v4 environment is a two-jointed robot arm. The goal is to move the robot's end effector (fingertip) close to a target that is spawned at a random position. The observation includes cosine and sine of joint angles, target coordinates, joint angular velocities, and the 3D vector from fingertip to target (11 elements total, including the z-component which is always 0 since reacher is 2D). The reward is: reward_dist + reward_ctrl, where reward_dist is the negative L2 distance from fingertip to target and reward_ctrl is the negative squared action norm. Note that v4 computes reward before the physics step. The Reacher never terminates; episodes end through truncation (default 50 timesteps).

Usage

Use this environment for reproducing results from papers that used Reacher-v4. For new research, consider Reacher-v5 which computes reward after the physics step, removes the constant-zero z-component from observations, and provides configurable reward weights.

Code Reference

Source Location

Signature

class ReacherEnv(MujocoEnv, utils.EzPickle):
    def __init__(self, **kwargs)

Import

import gymnasium as gym
env = gym.make("Reacher-v4")

I/O Contract

Inputs

Name Type Required Description
action np.ndarray (2,) Yes Torques applied to the two hinge joints, range [-1, 1]

Outputs

Name Type Description
observation np.ndarray (11,) State vector: cos(theta) (2), sin(theta) (2), target pos (2), angular velocities (2), fingertip-target vector (3)
reward float reward_dist + reward_ctrl (computed before physics step)
terminated bool Always False (Reacher never terminates)
truncated bool Episode truncation (handled by TimeLimit wrapper, default 50 timesteps)
info dict Contains reward_dist, reward_ctrl

Usage Examples

import gymnasium as gym

env = gym.make("Reacher-v4")
observation, info = env.reset(seed=42)

for _ in range(50):
    action = env.action_space.sample()
    observation, reward, terminated, truncated, info = env.step(action)
    if terminated or truncated:
        observation, info = env.reset()

env.close()

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment