Principle:LaurentMazare Tch rs Deep Deterministic Policy Gradient
| Knowledge Sources | |
|---|---|
| Domains | Reinforcement Learning, Continuous Control, Actor-Critic Methods |
| Last Updated | 2026-02-08 00:00 GMT |
Overview
Deep Deterministic Policy Gradient (DDPG) is an off-policy, model-free reinforcement learning algorithm that extends Q-learning to continuous action spaces by learning a deterministic policy alongside an action-value function.
Description
Classical reinforcement learning algorithms like DQN work well for discrete action spaces (e.g., choosing from a finite set of buttons), but many real-world control problems have continuous action spaces (e.g., applying a torque value anywhere in a range). DDPG addresses this gap by combining ideas from several areas:
- Actor-Critic architecture - Two neural networks work in tandem: the actor (policy network) maps states directly to continuous actions, and the critic (value network) estimates the expected return of taking action in state .
- Deterministic policy - Unlike stochastic policy gradient methods that output action probability distributions, DDPG's actor outputs a single deterministic action for each state. This is more sample-efficient for continuous spaces because it avoids integrating over the action space.
- Off-policy learning with experience replay - Transitions are stored in a replay buffer and sampled randomly for training. This breaks temporal correlations, improves sample efficiency, and allows learning from past experience.
- Target networks - Separate target copies of both actor and critic networks are maintained for computing TD targets. These target networks are updated slowly via soft updates (Polyak averaging), which stabilizes training by preventing the learning targets from changing too rapidly.
- Ornstein-Uhlenbeck exploration noise - Since the deterministic policy produces no exploration on its own, noise is added to actions during training. The Ornstein-Uhlenbeck (OU) process generates temporally correlated noise, which is physically plausible for inertial systems and provides smoother exploration than independent Gaussian noise.
Usage
DDPG is appropriate for:
- Continuous control tasks - Robotic manipulation, locomotion, autonomous driving, where actions are real-valued vectors.
- Low-dimensional action spaces - DDPG works best when the action space has moderate dimensionality (roughly 1-20 dimensions).
- Environments with smooth dynamics - The deterministic policy and OU noise work best when the optimal policy is smooth with respect to the state.
DDPG may not be suitable for:
- Discrete action spaces - DQN or policy gradient methods are more natural.
- Very high-dimensional action spaces - Scalability becomes an issue.
- Highly stochastic environments - A stochastic policy may be fundamentally necessary.
Modern successors like TD3 and SAC address several of DDPG's stability issues while retaining its core ideas.
Theoretical Basis
Deterministic Policy Gradient Theorem
The key theoretical result (Silver et al., 2014) shows that the gradient of the expected return with respect to the policy parameters is:
where is the state distribution under policy . This avoids integrating over the action space, requiring only the gradient of with respect to the action, evaluated at the action chosen by the current policy.
Critic Update
The critic is trained to minimize the Bellman error using transitions sampled from the replay buffer:
where and are the target networks and is the discount factor.
Actor Update
The actor is updated by ascending the deterministic policy gradient, approximated using a minibatch:
In practice, this is computed by:
- Forward pass the state through the actor to get action .
- Forward pass through the critic to get .
- Backpropagate through both networks, but only update the actor's parameters.
Soft Target Updates
Target networks are updated using Polyak averaging with coefficient (typically ):
This creates a slowly-moving target that stabilizes the TD learning process, analogous to the fixed target network in DQN but smoother.
Ornstein-Uhlenbeck Noise Process
The OU process is defined by the stochastic differential equation:
where:
- is the mean reversion rate (how fast noise returns to the mean)
- is the long-term mean (typically 0)
- is the volatility (noise magnitude)
- is a Wiener process (Brownian motion)
In discrete time with step size :
x_{t+1} = x_t + theta * (mu - x_t) + sigma * N(0, 1)
The exploration action is then:
Experience Replay Buffer
The replay buffer stores the most recent transitions as a circular buffer:
BUFFER of capacity M
FUNCTION store(s, a, r, s', done):
buffer[position] = (s, a, r, s', done)
position = (position + 1) mod M
FUNCTION sample(batch_size):
indices = random_integers(0, current_size, batch_size)
RETURN buffer[indices]
Complete DDPG Training Loop
INITIALIZE actor mu, critic Q with random weights INITIALIZE target networks mu', Q' as copies INITIALIZE replay buffer B INITIALIZE OU noise process
FOR each episode:
RESET environment, get initial state s
RESET noise process
FOR each step:
a = mu(s) + noise.sample()
s', r, done = environment.step(a)
B.store(s, a, r, s', done)
IF B.size >= batch_size:
Sample minibatch from B
Compute target y = r + gamma * Q'(s', mu'(s'))
Update critic by minimizing (y - Q(s,a))^2
Update actor using policy gradient
Soft-update target networks
s = s'
IF done: BREAK