Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:Gretelai Gretel synthetics Timeseries Data Preparation

From Leeroopedia
Knowledge Sources
Domains Synthetic_Data, Time_Series, GAN
Last Updated 2026-02-14 19:00 GMT

Overview

Timeseries Data Preparation is the process of converting raw user-provided time series data (numpy arrays or DataFrames) into the normalized, encoded internal representation required by the DoppelGANger GAN for training.

Description

Before a time series GAN can train, the input data must pass through several transformation stages:

Type Detection: When explicit type annotations are not provided, the system automatically determines whether each variable is continuous or discrete. Float and integer columns are treated as continuous, while string columns are treated as discrete. Users may override this behavior by providing explicit OutputType lists or by specifying discrete columns when using the DataFrame interface.

Output Metadata Creation: Based on the detected types and the data values, Output metadata objects are created for each variable. Continuous variables receive ContinuousOutput instances (fitted with global min/max), while discrete variables receive either OneHotEncodedOutput or BinaryEncodedOutput instances depending on the number of unique values relative to the binary_encoder_cutoff threshold. These metadata objects store the encoding parameters needed for both forward and inverse transformations.

NaN Handling: For continuous features, the system validates examples by checking the ratio of NaN values and the maximum count of consecutive NaNs. Examples exceeding the thresholds are marked invalid and excluded from training. Valid examples with remaining NaNs are repaired via linear interpolation.

Feature Transformation: Continuous features are scaled to [0,1] or [-1,1] based on global min/max. When per-example scaling is enabled, additional attributes (midpoint and half-range) are computed for each continuous feature per example, and the features are further rescaled within each example's range. Discrete features are one-hot or binary encoded.

Attribute Transformation: Attributes undergo the same encoding as features (continuous scaling or discrete encoding) but without per-example scaling since attributes have only one value per example.

Padding: Variable-length sequences are padded to max_sequence_len with zeros in the internal representation.

Tensor Conversion: The final numpy arrays are wrapped in a PyTorch TensorDataset containing three tensors: attributes, additional attributes (midpoint/half-range), and features. When attributes or additional attributes are absent, nan-filled placeholder tensors are used.

Usage

Data preparation runs automatically when calling train_numpy() or train_dataframe(). The first call also triggers model building (network construction). Subsequent calls reuse the existing Output metadata. Use train_numpy() when data is already in numpy array form, and train_dataframe() when working with pandas DataFrames in either "wide" (one row per example) or "long" (one row per time point) format.

Theoretical Basis

The data preparation pipeline implements the preprocessing requirements of the DoppelGANger architecture:

Normalization: Continuous variables must be mapped to a bounded range matching the generator output activation. For sigmoid (ZERO_ONE):

x_scaled = (x - x_min) / max(x_max - x_min, 1e-6)

For tanh (MINUSONE_ONE):

x_scaled = 2 * (x - x_min) / max(x_max - x_min, 1e-6) - 1

Per-example scaling: When time series have highly variable ranges across examples (e.g., network traffic from dial-up vs fiber connections), per-example scaling computes:

midpoint = (example_min + example_max) / 2

half_range = (example_max - example_min) / 2

These are generated as additional attributes, allowing the model to learn the distribution of ranges while normalizing each example independently.

Discrete encoding: Discrete variables are converted to continuous representations. One-hot encoding creates a vector of length equal to the number of unique values. Binary encoding creates a vector of length ceil(log2(num_unique)), which is more memory-efficient for high-cardinality columns.

Variable-length sequence handling: Shorter sequences are zero-padded to max_sequence_len. In the DataFrame "long" format, a generation flag feature is appended to mark the end of the original sequence.

Related Pages

Implemented By

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment