Principle:Sktime Pytorch forecasting Categorical Encoding
| Knowledge Sources | |
|---|---|
| Domains | Time_Series, Data_Engineering, Preprocessing |
| Last Updated | 2026-02-08 07:00 GMT |
Overview
Technique for converting categorical string or mixed-type variables into integer indices suitable for neural network embedding layers.
Description
Categorical Encoding maps each unique category in a variable to an integer index. Neural networks cannot process string values directly; they require integer indices that are then mapped to dense vector representations via embedding layers. The encoding must handle edge cases: NaN values (common in real-world time series), unknown categories at inference time (categories not seen during training), and mixed numeric/string types. A robust label encoder provides a consistent mapping that is fitted on training data and safely applied to validation/test data, mapping unknown categories to a special index (typically 0).
Usage
Use categorical encoding when preparing time series data with categorical features (e.g., product IDs, store names, day-of-week). In pytorch-forecasting, categorical encoders are either auto-fitted during TimeSeriesDataSet construction or pre-fitted and passed via the categorical_encoders parameter. Pre-fitting is required when the group ID column needs consistent encoding across training and validation datasets.
Theoretical Basis
Label encoding:
Failed to parse (unknown function "\begin{cases}"): {\displaystyle \text{encode}(c) = \begin{cases} \text{mapping}[c] & \text{if } c \in \text{known\_classes} \\ 0 & \text{if } c \notin \text{known\_classes or } c = \text{NaN} \end{cases} }
With NaN handling:
# Abstract encoding pipeline
mapping = {}
if add_nan:
mapping[NaN] = 0
for idx, category in enumerate(sorted(unique_categories)):
mapping[category] = idx + (1 if add_nan else 0)
# Transform: map categories to integers
encoded = [mapping.get(c, 0) for c in data] # unknown -> 0
Embedding lookup: The integer indices feed into an embedding layer:
Where is the learned embedding matrix.