Principle:Sktime Pytorch forecasting Validation Dataset Creation
| Knowledge Sources | |
|---|---|
| Domains | Time_Series, Data_Engineering, Model_Evaluation |
| Last Updated | 2026-02-08 07:00 GMT |
Overview
Technique for creating validation datasets that share the same variable encoders, normalizers, and configuration as the training dataset but use held-out time periods.
Description
Validation Dataset Creation ensures consistency between training and evaluation data processing. In time series forecasting, validation data must use the exact same categorical encoders (label mappings), continuous variable scalers (mean/std or quantile parameters), and variable type configurations as the training set. This prevents data leakage and ensures that the model sees properly encoded inputs during evaluation. The from_dataset pattern clones all configuration from a source (training) dataset and applies it to new data, optionally disabling randomization and restricting to prediction-only samples.
Usage
Use this principle after constructing a training TimeSeriesDataSet. The validation dataset should cover a time range that extends beyond the training cutoff to provide out-of-sample evaluation. This is a required step in all forecasting workflows before training, as both training and validation DataLoaders must be provided to the Trainer.
Theoretical Basis
Time series cross-validation differs from standard i.i.d. splitting:
Temporal split rule: Data is split by time, not randomly. All validation samples have timestamps strictly after the training cutoff.
Encoder consistency: For a categorical variable with classes {A, B, C} seen in training, the validation set must use the same integer encoding (A→0, B→1, C→2). New classes in validation should map to a special unknown token.
Normalizer consistency: If training data is normalized with mean=μ and std=σ, validation data must use the same μ and σ — not recompute them from validation data.
Pseudo-code:
# Abstract validation dataset creation
val_dataset = clone_config(training_dataset)
val_dataset.data = full_dataframe # includes both train and val periods
val_dataset.stop_randomization = True # deterministic sampling for eval
val_dataset.predict = False # or True for single prediction per group