Principle:Gretelai Gretel synthetics WGAN GP Training
| Knowledge Sources | |
|---|---|
| Domains | Synthetic_Data, Time_Series, GAN |
| Last Updated | 2026-02-14 19:00 GMT |
Overview
WGAN-GP Training is an adversarial training procedure that uses the Wasserstein distance with gradient penalty to iteratively optimize a generator and discriminator, enabling stable training of GANs for time series synthesis.
Description
The Wasserstein GAN with Gradient Penalty (WGAN-GP) is the training algorithm used in DoppelGANger. Unlike standard GANs that optimize a Jensen-Shannon divergence-based objective, WGAN-GP optimizes an approximation of the Wasserstein (Earth Mover's) distance between the real and generated data distributions. The gradient penalty term replaces weight clipping to enforce the 1-Lipschitz constraint on the discriminator (critic).
The DoppelGANger training loop has the following structure per batch:
Step 1 - Generate Fake Data: Sample attribute noise and feature noise, pass through the Generator to produce a batch of fake (attributes, additional_attributes, features).
Step 2 - Train Feature Discriminator: For each discriminator round:
- Compute discriminator output on both generated and real batches by concatenating attributes, additional attributes, and flattened features.
- Calculate the Wasserstein loss:
L = E[D(fake)] - E[D(real)] - Compute gradient penalty on interpolated data between real and fake samples.
- Total discriminator loss:
L_D = L + lambda_gp * GP - Update discriminator weights via Adam optimizer.
Step 3 - Train Attribute Discriminator: If enabled, for each discriminator round:
- Compute attribute discriminator output on attributes and additional attributes only (features excluded).
- Calculate the same Wasserstein loss plus gradient penalty formulation.
- Update attribute discriminator weights via Adam optimizer.
Step 4 - Train Generator: For each generator round:
- Compute both discriminators' outputs on the generated batch.
- Generator loss:
L_G = -E[D(fake)] + attribute_loss_coef * (-E[D_attr(fake_attr)]) - Update generator weights via Adam optimizer.
Gradient Penalty Computation: For each element in the batch, a random interpolation factor alpha is sampled uniformly. Interpolated data is constructed as x_hat = real + alpha * (fake - real). The discriminator gradient with respect to x_hat is computed, and the penalty is the squared deviation of the gradient norm from 1:
GP = E[(||grad D(x_hat)||_2 - 1)^2]
A small epsilon (1e-8) is added inside the norm calculation for numerical stability.
Mixed Precision: The training loop supports automatic mixed precision via torch.cuda.amp, using a GradScaler to manage float16/float32 conversions for reduced memory and faster computation.
Usage
The training loop executes automatically when train_numpy() or train_dataframe() is called, after data preparation and model building. The user controls training behavior through DGANConfig parameters: epochs, batch_size, discriminator_rounds, generator_rounds, learning rates, beta1 values, gradient penalty coefficients, and attribute_loss_coef. An optional progress callback receives ProgressInfo objects after each batch.
Theoretical Basis
Wasserstein Distance: The Wasserstein-1 distance (Earth Mover's distance) between distributions P_r and P_g is:
W(P_r, P_g) = sup_{||f||_L <= 1} E_{x ~ P_r}[f(x)] - E_{x ~ P_g}[f(x)]
where the supremum is over all 1-Lipschitz functions f. The discriminator D approximates this optimal f.
Gradient Penalty (Gulrajani et al., 2017): Instead of clipping discriminator weights, the gradient penalty enforces the Lipschitz constraint softly:
GP = E_{x_hat ~ P_{x_hat}} [(||nabla_{x_hat} D(x_hat)||_2 - 1)^2]
where x_hat = epsilon * x_real + (1 - epsilon) * x_fake with epsilon ~ U[0,1].
Feature Discriminator Loss:
L_D = E[D(G(z))] - E[D(x)] + lambda * GP
Attribute Discriminator Loss:
L_{D_attr} = E[D_attr(G(z)_attr)] - E[D_attr(x_attr)] + lambda_attr * GP_attr
Generator Loss:
L_G = -E[D(G(z))] + alpha * (-E[D_attr(G(z)_attr)])
where alpha is the attribute_loss_coef.
Adam Optimizer: All three networks use Adam with a reduced beta1 (default 0.5 instead of the typical 0.9), following standard GAN training practice for reduced momentum that helps with non-stationary optimization.