Principle:LaurentMazare Tch rs Deep Deterministic Policy Gradient

Knowledge Sources	LaurentMazare_Tch_rs control with deep reinforcement learning Lillicrap et al., 2016 Policy Gradient Algorithms Silver et al., 2014
Domains	Reinforcement Learning, Continuous Control, Actor-Critic Methods
Last Updated	2026-02-08 00:00 GMT

Overview

Deep Deterministic Policy Gradient (DDPG) is an off-policy, model-free reinforcement learning algorithm that extends Q-learning to continuous action spaces by learning a deterministic policy alongside an action-value function.

Description

Classical reinforcement learning algorithms like DQN work well for discrete action spaces (e.g., choosing from a finite set of buttons), but many real-world control problems have continuous action spaces (e.g., applying a torque value anywhere in a range). DDPG addresses this gap by combining ideas from several areas:

Actor-Critic architecture - Two neural networks work in tandem: the actor (policy network) $μ (s | θ^{μ})$ maps states directly to continuous actions, and the critic (value network) $Q (s, a | θ^{Q})$ estimates the expected return of taking action $a$ in state $s$ .

Deterministic policy - Unlike stochastic policy gradient methods that output action probability distributions, DDPG's actor outputs a single deterministic action for each state. This is more sample-efficient for continuous spaces because it avoids integrating over the action space.

Off-policy learning with experience replay - Transitions $(s, a, r, s^{'})$ are stored in a replay buffer and sampled randomly for training. This breaks temporal correlations, improves sample efficiency, and allows learning from past experience.

Target networks - Separate target copies of both actor and critic networks are maintained for computing TD targets. These target networks are updated slowly via soft updates (Polyak averaging), which stabilizes training by preventing the learning targets from changing too rapidly.

Ornstein-Uhlenbeck exploration noise - Since the deterministic policy produces no exploration on its own, noise is added to actions during training. The Ornstein-Uhlenbeck (OU) process generates temporally correlated noise, which is physically plausible for inertial systems and provides smoother exploration than independent Gaussian noise.

Usage

DDPG is appropriate for:

Continuous control tasks - Robotic manipulation, locomotion, autonomous driving, where actions are real-valued vectors.
Low-dimensional action spaces - DDPG works best when the action space has moderate dimensionality (roughly 1-20 dimensions).
Environments with smooth dynamics - The deterministic policy and OU noise work best when the optimal policy is smooth with respect to the state.

DDPG may not be suitable for:

Discrete action spaces - DQN or policy gradient methods are more natural.
Very high-dimensional action spaces - Scalability becomes an issue.
Highly stochastic environments - A stochastic policy may be fundamentally necessary.

Modern successors like TD3 and SAC address several of DDPG's stability issues while retaining its core ideas.

Theoretical Basis

Deterministic Policy Gradient Theorem

The key theoretical result (Silver et al., 2014) shows that the gradient of the expected return $J$ with respect to the policy parameters $θ^{μ}$ is:

$\nabla_{θ^{μ}} J = 𝔼_{s \sim ρ^{μ}} [\nabla_{θ^{μ}} μ (s | θ^{μ}) \cdot \nabla_{a} Q (s, a | θ^{Q}) |_{a = μ (s)}]$

where $ρ^{μ}$ is the state distribution under policy $μ$ . This avoids integrating over the action space, requiring only the gradient of $Q$ with respect to the action, evaluated at the action chosen by the current policy.

Critic Update

The critic is trained to minimize the Bellman error using transitions $(s_{i}, a_{i}, r_{i}, s'_{i})$ sampled from the replay buffer:

$y_{i} = r_{i} + γ Q^{'} (s'_{i}, μ^{'} (s'_{i} | θ^{μ^{'}}) | θ^{Q^{'}})$

$L = \frac{1}{N} \sum_{i} (y_{i} - Q (s_{i}, a_{i} | θ^{Q}))^{2}$

where $Q^{'}$ and $μ^{'}$ are the target networks and $γ$ is the discount factor.

Actor Update

The actor is updated by ascending the deterministic policy gradient, approximated using a minibatch:

$\nabla_{θ^{μ}} J \approx \frac{1}{N} \sum_{i} \nabla_{θ^{μ}} μ (s_{i} | θ^{μ}) \cdot \nabla_{a} Q (s_{i}, a | θ^{Q}) |_{a = μ (s_{i})}$

In practice, this is computed by:

Forward pass the state through the actor to get action $a = μ (s)$ .
Forward pass $(s, a)$ through the critic to get $Q (s, a)$ .
Backpropagate through both networks, but only update the actor's parameters.

Soft Target Updates

Target networks are updated using Polyak averaging with coefficient $τ ≪ 1$ (typically $τ = 0.001$ ):

$θ^{Q^{'}} \leftarrow τ θ^{Q} + (1 - τ) θ^{Q^{'}}$ $θ^{μ^{'}} \leftarrow τ θ^{μ} + (1 - τ) θ^{μ^{'}}$

This creates a slowly-moving target that stabilizes the TD learning process, analogous to the fixed target network in DQN but smoother.

Ornstein-Uhlenbeck Noise Process

The OU process is defined by the stochastic differential equation:

$d x_{t} = θ_{O U} (μ_{O U} - x_{t}) d t + σ_{O U} d W_{t}$

where:

$θ_{O U}$ is the mean reversion rate (how fast noise returns to the mean)
$μ_{O U}$ is the long-term mean (typically 0)
$σ_{O U}$ is the volatility (noise magnitude)
$W_{t}$ is a Wiener process (Brownian motion)

In discrete time with step size $Δ t = 1$ :

x_{t+1} = x_t + theta * (mu - x_t) + sigma * N(0, 1)

The exploration action is then:

$a_{t} = μ (s_{t} | θ^{μ}) + x_{t}$

Experience Replay Buffer

The replay buffer stores the most recent $M$ transitions as a circular buffer:

BUFFER of capacity M
FUNCTION store(s, a, r, s', done):
    buffer[position] = (s, a, r, s', done)
    position = (position + 1) mod M

FUNCTION sample(batch_size):
    indices = random_integers(0, current_size, batch_size)
    RETURN buffer[indices]

Complete DDPG Training Loop

INITIALIZE actor mu, critic Q with random weights
INITIALIZE target networks mu', Q' as copies
INITIALIZE replay buffer B
INITIALIZE OU noise process

FOR each episode:
    RESET environment, get initial state s
    RESET noise process

    FOR each step:
        a = mu(s) + noise.sample()
        s', r, done = environment.step(a)
        B.store(s, a, r, s', done)

        IF B.size >= batch_size:
            Sample minibatch from B
            Compute target y = r + gamma * Q'(s', mu'(s'))
            Update critic by minimizing (y - Q(s,a))^2
            Update actor using policy gradient
            Soft-update target networks

        s = s'
        IF done: BREAK

Related Pages

Implementation:LaurentMazare_Tch_rs_DDPG_Agent

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment