Principle:Gretelai Gretel synthetics WGAN GP Training

Knowledge Sources	gretel-synthetics DoppelGANger
Domains	Synthetic_Data, Time_Series, GAN
Last Updated	2026-02-14 19:00 GMT

Overview

WGAN-GP Training is an adversarial training procedure that uses the Wasserstein distance with gradient penalty to iteratively optimize a generator and discriminator, enabling stable training of GANs for time series synthesis.

Description

The Wasserstein GAN with Gradient Penalty (WGAN-GP) is the training algorithm used in DoppelGANger. Unlike standard GANs that optimize a Jensen-Shannon divergence-based objective, WGAN-GP optimizes an approximation of the Wasserstein (Earth Mover's) distance between the real and generated data distributions. The gradient penalty term replaces weight clipping to enforce the 1-Lipschitz constraint on the discriminator (critic).

The DoppelGANger training loop has the following structure per batch:

Step 1 - Generate Fake Data: Sample attribute noise and feature noise, pass through the Generator to produce a batch of fake (attributes, additional_attributes, features).

Step 2 - Train Feature Discriminator: For each discriminator round:

Compute discriminator output on both generated and real batches by concatenating attributes, additional attributes, and flattened features.
Calculate the Wasserstein loss: L = E[D(fake)] - E[D(real)]
Compute gradient penalty on interpolated data between real and fake samples.
Total discriminator loss: L_D = L + lambda_gp * GP
Update discriminator weights via Adam optimizer.

Step 3 - Train Attribute Discriminator: If enabled, for each discriminator round:

Compute attribute discriminator output on attributes and additional attributes only (features excluded).
Calculate the same Wasserstein loss plus gradient penalty formulation.
Update attribute discriminator weights via Adam optimizer.

Step 4 - Train Generator: For each generator round:

Compute both discriminators' outputs on the generated batch.
Generator loss: L_G = -E[D(fake)] + attribute_loss_coef * (-E[D_attr(fake_attr)])
Update generator weights via Adam optimizer.

Gradient Penalty Computation: For each element in the batch, a random interpolation factor alpha is sampled uniformly. Interpolated data is constructed as x_hat = real + alpha * (fake - real). The discriminator gradient with respect to x_hat is computed, and the penalty is the squared deviation of the gradient norm from 1:

GP = E[(||grad D(x_hat)||_2 - 1)^2]

A small epsilon (1e-8) is added inside the norm calculation for numerical stability.

Mixed Precision: The training loop supports automatic mixed precision via torch.cuda.amp, using a GradScaler to manage float16/float32 conversions for reduced memory and faster computation.

Usage

The training loop executes automatically when train_numpy() or train_dataframe() is called, after data preparation and model building. The user controls training behavior through DGANConfig parameters: epochs, batch_size, discriminator_rounds, generator_rounds, learning rates, beta1 values, gradient penalty coefficients, and attribute_loss_coef. An optional progress callback receives ProgressInfo objects after each batch.

Theoretical Basis

Wasserstein Distance: The Wasserstein-1 distance (Earth Mover's distance) between distributions P_r and P_g is:

W(P_r, P_g) = sup_{||f||_L <= 1} E_{x ~ P_r}[f(x)] - E_{x ~ P_g}[f(x)]

where the supremum is over all 1-Lipschitz functions f. The discriminator D approximates this optimal f.

Gradient Penalty (Gulrajani et al., 2017): Instead of clipping discriminator weights, the gradient penalty enforces the Lipschitz constraint softly:

GP = E_{x_hat ~ P_{x_hat}} [(||nabla_{x_hat} D(x_hat)||_2 - 1)^2]

where x_hat = epsilon * x_real + (1 - epsilon) * x_fake with epsilon ~ U[0,1].

Feature Discriminator Loss:

L_D = E[D(G(z))] - E[D(x)] + lambda * GP

Attribute Discriminator Loss:

L_{D_attr} = E[D_attr(G(z)_attr)] - E[D_attr(x_attr)] + lambda_attr * GP_attr

Generator Loss:

L_G = -E[D(G(z))] + alpha * (-E[D_attr(G(z)_attr)])

where alpha is the attribute_loss_coef.

Adam Optimizer: All three networks use Adam with a reduced beta1 (default 0.5 instead of the typical 0.9), following standard GAN training practice for reduced momentum that helps with non-stationary optimization.

Related Pages

Implemented By

Implementation:Gretelai_Gretel_synthetics_DGAN_Train_Loop

Uses Heuristic

Heuristic:Gretelai_Gretel_synthetics_Mixed_Precision_Training_Tradeoff

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment