Principle:Sktime Pytorch forecasting Transformer Encoder Architecture

Knowledge Sources	pytorch-forecasting Attention Is All You Need TimeXer
Domains	Time_Series, Forecasting, Deep_Learning, Transformer_Architecture
Last Updated	2026-02-08 09:00 GMT

Overview

Stacked transformer encoder that combines self-attention, cross-attention, layer normalization, and position-wise feed-forward networks to produce contextualized sequence representations for time series forecasting.

Description

The Transformer Encoder Architecture consists of two components: an Encoder container that stacks multiple EncoderLayer blocks and applies optional final normalization and projection, and the EncoderLayer itself which defines the computation within each block.

Each EncoderLayer performs three main operations in sequence:

1. Self-Attention with Residual Connection: The input is passed through a self-attention mechanism (queries, keys, and values all come from the same input), followed by dropout. The result is added to the original input (residual connection) and layer-normalized.

2. Cross-Attention on a Global Token: A distinctive feature of this encoder (designed for the TimeXer architecture) is that cross-attention is applied only to a global token extracted from the last position of the self-attention output. This global token attends to external (exogenous) cross-input, gathering cross-variable information. The attended global token is added back via a residual connection and layer-normalized. It is then concatenated back with the remaining patch tokens.

3. Position-Wise Feed-Forward Network: The combined representation passes through a two-layer feed-forward network implemented with 1D convolutions (kernel size 1), an activation function (ReLU or GELU), and dropout. A final residual connection and layer normalization complete the block.

The Encoder wrapper iterates over all layers sequentially, optionally applying a final normalization layer and a projection layer at the end.

Usage

Use the Encoder and EncoderLayer for building the encoder stack of transformer-based time series forecasting models like TimeXer. The cross input to the forward pass carries exogenous variable information, while the primary input x carries the endogenous patch tokens plus the global token. Set d_ff (feed-forward dimension) to 4 times d_model as a standard default. Choose activation="relu" or activation="gelu" depending on the model configuration.

Theoretical Basis

Standard Transformer Encoder Layer:

$SelfAttn (X) = LayerNorm (X + Dropout (MultiHeadAttn (X, X, X)))$

Cross-Attention on Global Token (TimeXer variant):

$x_{{glb}^{'}} = LayerNorm (x_{glb} + Dropout (MultiHeadAttn (x_{glb}, X_{cross}, X_{cross})))$

Where $x_{glb}$ is the global token extracted from the last position of the self-attention output.

Position-Wise Feed-Forward Network:

$FFN (x) = W_{2} \cdot σ (W_{1} \cdot x + b_{1}) + b_{2}$

Where $σ$ is the activation function (ReLU or GELU) and $W_{1} \in ℝ^{d_{model} \times d_{ff}}$ , $W_{2} \in ℝ^{d_{ff} \times d_{model}}$ .

Full Layer Output:

$Output = LayerNorm (X_{concat} + FFN (X_{concat}))$

Pseudo-code for one encoder layer:

# EncoderLayer forward pass (pseudo-code)
def encoder_layer(x, cross):
    # 1. Self-attention
    x = layer_norm_1(x + dropout(self_attention(x, x, x)))

    # 2. Cross-attention on global token
    x_glb = x[:, -1, :]        # extract global token
    x_glb = layer_norm_2(x_glb + dropout(cross_attention(x_glb, cross, cross)))

    # 3. Reassemble and feed-forward
    x = concat(x[:, :-1, :], x_glb)
    y = dropout(activation(conv1d_up(x)))
    y = dropout(conv1d_down(y))
    return layer_norm_3(x + y)

Related Pages

Implemented By

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment