Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:Sktime Pytorch forecasting Validation Dataset Creation

From Leeroopedia


Knowledge Sources
Domains Time_Series, Data_Engineering, Model_Evaluation
Last Updated 2026-02-08 07:00 GMT

Overview

Technique for creating validation datasets that share the same variable encoders, normalizers, and configuration as the training dataset but use held-out time periods.

Description

Validation Dataset Creation ensures consistency between training and evaluation data processing. In time series forecasting, validation data must use the exact same categorical encoders (label mappings), continuous variable scalers (mean/std or quantile parameters), and variable type configurations as the training set. This prevents data leakage and ensures that the model sees properly encoded inputs during evaluation. The from_dataset pattern clones all configuration from a source (training) dataset and applies it to new data, optionally disabling randomization and restricting to prediction-only samples.

Usage

Use this principle after constructing a training TimeSeriesDataSet. The validation dataset should cover a time range that extends beyond the training cutoff to provide out-of-sample evaluation. This is a required step in all forecasting workflows before training, as both training and validation DataLoaders must be provided to the Trainer.

Theoretical Basis

Time series cross-validation differs from standard i.i.d. splitting:

Temporal split rule: Data is split by time, not randomly. All validation samples have timestamps strictly after the training cutoff.

Encoder consistency: For a categorical variable with classes {A, B, C} seen in training, the validation set must use the same integer encoding (A→0, B→1, C→2). New classes in validation should map to a special unknown token.

Normalizer consistency: If training data is normalized with mean=μ and std=σ, validation data must use the same μ and σ — not recompute them from validation data.

Pseudo-code:

# Abstract validation dataset creation
val_dataset = clone_config(training_dataset)
val_dataset.data = full_dataframe  # includes both train and val periods
val_dataset.stop_randomization = True  # deterministic sampling for eval
val_dataset.predict = False  # or True for single prediction per group

Related Pages

Implemented By

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment