Principle:Sktime Pytorch forecasting Positional Encoding

Knowledge Sources	pytorch-forecasting Attention Is All You Need TimeXer iTransformer
Domains	Time_Series, Forecasting, Deep_Learning, Embedding
Last Updated	2026-02-08 09:00 GMT

Overview

Position-aware embedding strategies for transformer-based time series models: sinusoidal positional encoding for absolute position information, patch-based encoder embedding with a learnable global token, and channel-independent inverted data embedding for exogenous variables.

Description

This principle covers three complementary embedding approaches used to prepare inputs for transformer-based forecasting models:

1. Sinusoidal Positional Embedding (PositionalEmbedding): Injects absolute position information into the representation using fixed (non-learnable) sinusoidal functions. Even-indexed dimensions use sine and odd-indexed dimensions use cosine, with frequencies decreasing geometrically across dimensions. The encoding is pre-computed for a maximum sequence length and stored as a non-trainable buffer. This allows the model to distinguish positions in the sequence without any learned parameters, and generalizes to unseen sequence lengths up to the pre-computed maximum.

2. Patch-Based Encoder Embedding (EnEmbedding): Designed for endogenous (target) variable embedding in the TimeXer architecture. The input time series is first permuted to a channel-first layout, then segmented into non-overlapping patches of fixed length using an unfold operation. Each patch is linearly projected to the model dimension. Sinusoidal positional encoding is added to convey the ordering of patches. A learnable global token is appended to the patch sequence for each variable; this token serves as an aggregation point that later participates in cross-attention to gather exogenous information. The output is reshaped so that all variables are processed as independent samples (channel independence).

3. Inverted Data Embedding (DataEmbedding_inverted): Embeds exogenous variables by treating each variable (channel) as a separate token whose feature vector spans the time dimension. The input is transposed from (Batch, Time, Channels) to (Batch, Channels, Time), and each channel-time vector is linearly projected to the model dimension. If time-stamp marks are available, they are concatenated with the variable channels before projection. This inverted perspective allows the transformer to capture inter-variable dependencies directly.

Usage

Use PositionalEmbedding whenever sequence order must be encoded in transformer inputs; it is used internally by EnEmbedding. Use EnEmbedding for the endogenous encoder path in TimeXer, configuring patch_len to control the granularity of temporal segmentation. Use DataEmbedding_inverted for embedding exogenous or cross-variable features in iTransformer-style and TimeXer architectures, passing optional x_mark timestamp features to enrich the representation.

Theoretical Basis

Sinusoidal Positional Encoding:

$P E_{(p o s, 2 i)} = \sin (\frac{p o s}{1000 0^{2 i / d_{model}}})$

$P E_{(p o s, 2 i + 1)} = \cos (\frac{p o s}{1000 0^{2 i / d_{model}}})$

Where $p o s$ is the position index and $i$ is the dimension index. The wavelengths form a geometric progression from $2 π$ to $10000 \cdot 2 π$ .

Patch-Based Encoding:

Given input $x \in ℝ^{B \times T \times C}$ and patch length $P$ :

$N_{patches} = ⌊ T / P ⌋$

$x_{patches} \in ℝ^{B \cdot C \times N_{patches} \times P}$

$e = W_{val} \cdot x_{patches} + P E$

A global token $g \in ℝ^{1 \times C \times 1 \times d_{model}}$ is appended:

$e_{full} = Concat (e, g)$

Inverted (Channel-Independent) Embedding:

$x^{'} = x^{⊤} \in ℝ^{B \times C \times T}$

$e = W_{val} \cdot x^{'} + b$

Where $W_{val} \in ℝ^{T \times d_{model}}$ projects each channel's time-series vector to the model dimension.

Pseudo-code for patch-based embedding:

# EnEmbedding forward pass (pseudo-code)
def en_embedding(x, patch_len):
    x = x.permute(0, 2, 1)                  # (B, C, T)
    x = unfold(x, size=patch_len, step=patch_len)  # (B, C, N_patches, P)
    x = reshape(x, (B*C, N_patches, P))
    x = linear_value(x) + positional_encoding(x)
    x = reshape(x, (B, C, N_patches, d_model))
    x = concat(x, global_token)             # append learnable token
    x = reshape(x, (B*C, N_patches+1, d_model))
    return dropout(x)

Related Pages

Implemented By

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment