Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:LaurentMazare Tch rs DDPG Agent

From Leeroopedia


Overview

ddpg.rs is a Rust implementation of the Deep Deterministic Policy Gradient (DDPG) algorithm located at examples/reinforcement-learning/ddpg.rs (343 lines) in the tch-rs repository. DDPG is an actor-critic reinforcement learning algorithm designed for continuous action spaces, as described in "Continuous control with deep reinforcement learning" (Lillicrap et al., 2015). This implementation demonstrates how to build a complete deep RL training loop using the tch-rs bindings to PyTorch, including neural network construction, experience replay, Ornstein-Uhlenbeck exploration noise, target network soft updates, and episodic training on the Pendulum-v1 environment via a Python Gym bridge.

Code Reference

Constants

Constant Value Purpose
GAMMA 0.99 Discount factor for future rewards in the Bellman equation
TAU 0.005 Soft update weight for target network tracking
REPLAY_BUFFER_CAPACITY 100,000 Maximum number of transitions stored in the replay buffer
TRAINING_BATCH_SIZE 100 Number of transitions sampled per training iteration
MAX_EPISODES 100 Total number of training episodes
EPISODE_LENGTH 200 Maximum number of steps per episode
TRAINING_ITERATIONS 200 Number of gradient updates after each episode
MU 0.0 Mean of the Ornstein-Uhlenbeck process
THETA 0.15 Mean-reversion rate of the OU process
SIGMA 0.1 Volatility (noise magnitude) of the OU process
ACTOR_LEARNING_RATE 1e-4 Learning rate for the actor network (Adam optimizer)
CRITIC_LEARNING_RATE 1e-3 Learning rate for the critic network (Adam optimizer)

OuNoise

struct OuNoise {
    mu: f64,
    theta: f64,
    sigma: f64,
    state: Tensor,
}

Implements the Ornstein-Uhlenbeck process for temporally correlated exploration noise in continuous action spaces.

Key methods:

  • fn new(mu: f64, theta: f64, sigma: f64, num_actions: usize) -> Self -- Initializes the noise process with state set to ones of size num_actions.
  • fn sample(&mut self) -> &Tensor -- Computes one OU step: dx = theta * (mu - state) + sigma * N(0,1), updates internal state, and returns a reference to the noise tensor.

ReplayBuffer

struct ReplayBuffer {
    obs: Tensor,
    next_obs: Tensor,
    rewards: Tensor,
    actions: Tensor,
    capacity: usize,
    len: usize,
    i: usize,
}

A fixed-capacity circular replay buffer that stores transitions as pre-allocated tensors on CPU.

Key methods:

  • fn new(capacity: usize, num_obs: usize, num_actions: usize) -> Self -- Allocates zero-filled tensors of shape [capacity, num_obs], [capacity, num_actions], and [capacity, 1] for rewards.
  • fn push(&mut self, obs: &Tensor, actions: &Tensor, reward: &Tensor, next_obs: &Tensor) -- Stores a transition at the current circular index. Uses copy_ for in-place tensor updates.
  • fn random_batch(&self, batch_size: usize) -> Option<(Tensor, Tensor, Tensor, Tensor)> -- Returns a random batch of (states, actions, rewards, next_states) using Tensor::randint for index sampling. Returns None if fewer than 3 transitions are stored.

Actor

struct Actor {
    var_store: nn::VarStore,
    network: nn::Sequential,
    device: Device,
    num_obs: usize,
    num_actions: usize,
    opt: nn::Optimizer,
    learning_rate: f64,
}

The policy network that maps observations to continuous actions.

Architecture:

  • Linear layer: num_obs -> 400, followed by ReLU
  • Linear layer: 400 -> 300, followed by ReLU
  • Linear layer: 300 -> num_actions, followed by tanh (output bounded to [-1, 1])

Key methods:

  • fn new(num_obs: usize, num_actions: usize, learning_rate: f64) -> Self -- Builds the network with an Adam optimizer.
  • fn forward(&self, obs: &Tensor) -> Tensor -- Runs a forward pass, moving observations to the correct device.
  • impl Clone for Actor -- Deep clones the actor by creating a new instance and copying the VarStore.

Critic

struct Critic {
    var_store: nn::VarStore,
    network: nn::Sequential,
    device: Device,
    num_obs: usize,
    num_actions: usize,
    opt: nn::Optimizer,
    learning_rate: f64,
}

The Q-value network that estimates the value of a state-action pair.

Architecture:

  • Linear layer: (num_obs + num_actions) -> 400, followed by ReLU
  • Linear layer: 400 -> 300, followed by ReLU
  • Linear layer: 300 -> 1 (scalar Q-value, no activation)

Key methods:

  • fn new(num_obs: usize, num_actions: usize, learning_rate: f64) -> Self -- Builds the network. The input dimension is the concatenation of observation and action dimensions.
  • fn forward(&self, obs: &Tensor, actions: &Tensor) -> Tensor -- Concatenates actions and observations along dimension 1, then applies the network.
  • impl Clone for Critic -- Deep clones via VarStore::copy.

Helper Function: track

fn track(dest: &mut nn::VarStore, src: &nn::VarStore, tau: f64)

Performs Polyak (soft) averaging to update target network parameters: dest = tau * src + (1 - tau) * dest. Operates inside tch::no_grad to avoid tracking gradients.

Agent

struct Agent {
    actor: Actor,
    actor_target: Actor,
    critic: Critic,
    critic_target: Critic,
    replay_buffer: ReplayBuffer,
    ou_noise: OuNoise,
    train: bool,
    gamma: f64,
    tau: f64,
}

The top-level DDPG agent that orchestrates all components.

Key methods:

  • fn new(actor, critic, ou_noise, replay_buffer_capacity, train, gamma, tau) -> Self -- Clones actor and critic to create target networks; allocates the replay buffer.
  • fn actions(&mut self, obs: &Tensor) -> Tensor -- Computes actions using the actor (inside no_grad) and adds OU noise if in training mode.
  • fn remember(&mut self, obs, actions, reward, next_obs) -- Stores a transition in the replay buffer.
  • fn train(&mut self, batch_size: usize) -- Performs one DDPG training step:
    1. Samples a random batch from the replay buffer (returns early if insufficient data).
    2. Computes target Q-values: q_target = reward + gamma * critic_target(next_state, actor_target(next_state)), detached from the computation graph.
    3. Updates the critic by minimizing MSE between predicted and target Q-values.
    4. Updates the actor by maximizing the critic's Q-value estimate for the actor's chosen actions (negated for gradient descent).
    5. Soft-updates both target networks via track.

Entry Point

pub fn run() -> cpython::PyResult<()>

Creates a Pendulum-v1 environment via the GymEnv Python bridge, constructs the agent with the default hyperparameters, and runs the training loop:

  1. For each of MAX_EPISODES episodes, collects up to EPISODE_LENGTH steps.
  2. Actions are scaled by 2.0 and clamped to [-2.0, 2.0] (matching Pendulum-v1's action range).
  3. After each episode, performs TRAINING_ITERATIONS gradient updates.
  4. Prints the total reward per episode.

I/O Contract

Inputs

  • Environment: Pendulum-v1 from OpenAI Gym, accessed through GymEnv (a cpython bridge defined in the same example directory).
  • Observation space: Continuous vector (3-dimensional for Pendulum-v1: cos(theta), sin(theta), angular velocity).
  • Action space: Continuous scalar (1-dimensional for Pendulum-v1, range [-2.0, 2.0]).

Outputs

  • Actions: Continuous-valued tensors produced by the actor network, bounded by tanh to [-1, 1], then scaled and clamped to [-2.0, 2.0].
  • Console output: Per-episode total reward, action space dimensions, and observation space shape.
  • Trained model: The actor and critic networks are trained in-place (no model saving is performed in this example).

Invariants

  • The replay buffer requires at least 3 stored transitions before training can begin.
  • Target networks are always initialized as exact copies of the online networks.
  • OU noise is only applied during training mode (self.train == true).
  • All tensor operations run on CPU (tch::Device::Cpu).

Algorithm Summary

The DDPG update follows this sequence each training step:

  1. Sample a mini-batch of (s, a, r, s') from the replay buffer.
  2. Critic update: Compute y = r + gamma * Q_target(s', mu_target(s')). Minimize MSE(Q(s, a), y).
  3. Actor update: Maximize Q(s, mu(s)) with respect to the actor parameters (equivalently, minimize -Q(s, mu(s))).
  4. Soft update: Move target networks toward online networks using tau.

Dependencies

  • tch -- Rust bindings to libtorch for tensor operations, neural network modules, and optimizers
  • cpython -- Python interoperability for accessing OpenAI Gym environments
  • super::gym_env::GymEnv -- Rust wrapper around Python Gym environments defined in the same example module

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment