Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Farama Foundation Gymnasium ReacherEnv V5

From Leeroopedia
Knowledge Sources
Domains Reinforcement_Learning, MuJoCo_Environments
Last Updated 2026-02-15 03:00 GMT

Overview

Concrete implementation of the Reacher v5 MuJoCo environment provided by Gymnasium.

Description

The Reacher v5 environment is a two-jointed robot arm. The goal is to move the robot's end effector (fingertip) close to a target spawned at a random position. The observation includes cosine and sine of joint angles, target coordinates, joint angular velocities, and the 2D vector from fingertip to target (10 elements, reduced from 11 in v4 by removing the constant-zero z-component). The reward is: reward_distance + reward_control, where reward_distance is the weighted negative L2 distance from fingertip to target and reward_control is the weighted negative squared action norm. The v5 version computes reward after the physics step (fixing a bug in v4) and provides configurable reward weights.

Usage

Use this environment for benchmarking RL algorithms on a simple reaching task. The Reacher is a good introductory MuJoCo environment for testing goal-conditioned and continuous control algorithms. The v5 version is recommended for new research with its bug fixes and improved observations.

Code Reference

Source Location

Signature

class ReacherEnv(MujocoEnv, utils.EzPickle):
    def __init__(
        self,
        xml_file: str = "reacher.xml",
        frame_skip: int = 2,
        default_camera_config: dict[str, float | int] = DEFAULT_CAMERA_CONFIG,
        reward_dist_weight: float = 1,
        reward_control_weight: float = 1,
        **kwargs,
    )

Import

import gymnasium as gym
env = gym.make("Reacher-v5")

I/O Contract

Inputs

Name Type Required Description
action np.ndarray (2,) Yes Torques applied to the two hinge joints, range [-1, 1]

Outputs

Name Type Description
observation np.ndarray (10,) State vector: cos(theta) (2), sin(theta) (2), target pos (2), angular velocities (2), fingertip-target vector xy (2)
reward float reward_dist + reward_ctrl (computed after physics step, with configurable weights)
terminated bool Always False (Reacher never terminates)
truncated bool Episode truncation (handled by TimeLimit wrapper, default 50 timesteps)
info dict Contains reward_dist, reward_ctrl

Usage Examples

import gymnasium as gym

env = gym.make("Reacher-v5")
observation, info = env.reset(seed=42)

for _ in range(50):
    action = env.action_space.sample()
    observation, reward, terminated, truncated, info = env.step(action)
    if terminated or truncated:
        observation, info = env.reset()

env.close()

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment