Principle:Sktime Pytorch forecasting V2 Data Pipeline

Knowledge Sources	pytorch-forecasting
Domains	Time_Series, Forecasting, Deep_Learning, Data_Engineering
Last Updated	2026-02-08 09:00 GMT

Overview

The V2 data pipeline architecture in pytorch-forecasting provides a two-layer system for ingesting raw time series data and preparing batched tensor inputs for deep learning models. It comprises three core components: the TimeSeries dataset (D1 raw data layer), the TslibDataModule (D2 batching layer for tslib-style models), and the EncoderDecoderTimeSeriesDataModule (D2 batching layer for encoder-decoder models).

Description

The V2 data pipeline is an experimental rework of the original pytorch-forecasting data layer, designed to be more modular and to accommodate multiple model families. It is organized as a two-layer architecture:

Layer D1 -- TimeSeries Dataset: The TimeSeries class is a PyTorch Dataset that ingests a pandas DataFrame and exposes individual time series as dictionaries of tensors. It handles grouping by identifier columns, splitting features into categorical and continuous types, marking features as known or unknown for future time steps, and recording static covariates. Each call to __getitem__ returns a dictionary containing time indices (t), target values (y), feature matrix (x), group identifiers (group), static features (st), and a cutoff time. It also supports an optional data_future DataFrame to merge known future covariates with historical data. The metadata it exposes (via get_metadata) provides column names, type mappings (F for float, C for categorical), and known/unknown annotations (K/U) that downstream D2 modules rely on.

Layer D2 -- TslibDataModule: The TslibDataModule is a Lightning DataModule tailored for tslib-style transformer models (e.g., Informer, AutoFormer, TimeXer). It consumes a TimeSeries dataset and creates sliding windows of fixed context_length and prediction_length. Its _TslibDataset inner class produces dict-based batches with keys such as history_cont, history_cat, future_cont, future_cat, history_mask, future_mask, and history_target/future_target. The module automatically classifies the feature mode as S (single), MS (multivariate-to-single), or M (multivariate) based on the number of targets and covariates. It prepares comprehensive metadata including feature names, indices, and counts that models use to configure their layer dimensions.

Layer D2 -- EncoderDecoderTimeSeriesDataModule: The EncoderDecoderTimeSeriesDataModule is a Lightning DataModule designed for classical encoder-decoder architectures (e.g., TFT, DeepAR). It uses the terminology of encoder (history) and decoder (future) rather than history/future. Its inner _ProcessedEncoderDecoderDataset yields batches with keys like encoder_cat, encoder_cont, decoder_cat, decoder_cont, encoder_lengths, decoder_lengths, and target_past. It computes target scale from the encoder window for normalization and distinguishes known from unknown features in the decoder portion. Its metadata reports feature counts for encoder categorical, encoder continuous, decoder categorical (known only), decoder continuous (known only), static features, and sequence length parameters.

Both D2 modules share common patterns: sliding window creation with configurable stride, random train/validation/test splitting, custom collate functions that stack variable-count tensors into batches, and lazy metadata computation via a cached property.

Usage

Use the V2 data pipeline when building models with the experimental pytorch-forecasting v2 interface. Choose TslibDataModule for transformer-based tslib architectures that expect flat dict inputs with history_cont/future_cont keys. Choose EncoderDecoderTimeSeriesDataModule for encoder-decoder models that expect encoder_cat/decoder_cat style keys. In both cases, first construct a TimeSeries dataset from a pandas DataFrame specifying target, group, time, and covariate columns, then pass it to the appropriate data module with the desired context and prediction lengths.

Theoretical Basis

Sliding Window Decomposition:

Given a time series of length $T$ , the pipeline generates training samples by sliding a window of total length $L_{c} + L_{p}$ across the series, where $L_{c}$ is the context (encoder) length and $L_{p}$ is the prediction (decoder) length:

${Window}_{i} = {x_{i}, x_{i + 1}, \dots, x_{i + L_{c} + L_{p} - 1}}, i = 0, s, 2 s, \dots$

where $s$ is the window stride.

Feature Taxonomy (V2):

Continuous features (col_type = F) -- Numeric features fed directly to the model
Categorical features (col_type = C) -- Integer-encoded features typically passed through embedding layers
Known features (col_known = K) -- Available in both encoder and decoder windows
Unknown features (col_known = U) -- Available only in the encoder (history) window
Static features -- Time-invariant covariates broadcast across all time steps

Feature Mode Detection:

# Pseudo-code for automatic feature mode classification
if n_targets == 1 and n_covariates == 0:
    mode = "S"    # Single-variable forecasting
elif n_targets == 1 and n_covariates >= 1:
    mode = "MS"   # Multivariate input, single target output
elif n_targets > 1:
    mode = "M"    # Fully multivariate forecasting

Batch Structure (TslibDataModule):

# Each batch item is a tuple (x_dict, y_tensor)
x = {
    "history_cont":   Tensor(context_length, n_cont_features),
    "history_cat":    Tensor(context_length, n_cat_features),
    "future_cont":    Tensor(prediction_length, n_known_cont),
    "future_cat":     Tensor(prediction_length, n_known_cat),
    "history_mask":   BoolTensor(context_length),
    "future_mask":    BoolTensor(prediction_length),
    "history_target": Tensor(context_length, n_targets),
    "future_target":  Tensor(prediction_length, n_targets),
}

Related Pages

Implemented By

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment