Principle:Sktime Pytorch forecasting Transformer Encoder Architecture
| Knowledge Sources | |
|---|---|
| Domains | Time_Series, Forecasting, Deep_Learning, Transformer_Architecture |
| Last Updated | 2026-02-08 09:00 GMT |
Overview
Stacked transformer encoder that combines self-attention, cross-attention, layer normalization, and position-wise feed-forward networks to produce contextualized sequence representations for time series forecasting.
Description
The Transformer Encoder Architecture consists of two components: an Encoder container that stacks multiple EncoderLayer blocks and applies optional final normalization and projection, and the EncoderLayer itself which defines the computation within each block.
Each EncoderLayer performs three main operations in sequence:
1. Self-Attention with Residual Connection: The input is passed through a self-attention mechanism (queries, keys, and values all come from the same input), followed by dropout. The result is added to the original input (residual connection) and layer-normalized.
2. Cross-Attention on a Global Token: A distinctive feature of this encoder (designed for the TimeXer architecture) is that cross-attention is applied only to a global token extracted from the last position of the self-attention output. This global token attends to external (exogenous) cross-input, gathering cross-variable information. The attended global token is added back via a residual connection and layer-normalized. It is then concatenated back with the remaining patch tokens.
3. Position-Wise Feed-Forward Network: The combined representation passes through a two-layer feed-forward network implemented with 1D convolutions (kernel size 1), an activation function (ReLU or GELU), and dropout. A final residual connection and layer normalization complete the block.
The Encoder wrapper iterates over all layers sequentially, optionally applying a final normalization layer and a projection layer at the end.
Usage
Use the Encoder and EncoderLayer for building the encoder stack of transformer-based time series forecasting models like TimeXer. The cross input to the forward pass carries exogenous variable information, while the primary input x carries the endogenous patch tokens plus the global token. Set d_ff (feed-forward dimension) to 4 times d_model as a standard default. Choose activation="relu" or activation="gelu" depending on the model configuration.
Theoretical Basis
Standard Transformer Encoder Layer:
Cross-Attention on Global Token (TimeXer variant):
Where is the global token extracted from the last position of the self-attention output.
Position-Wise Feed-Forward Network:
Where is the activation function (ReLU or GELU) and , .
Full Layer Output:
Pseudo-code for one encoder layer:
# EncoderLayer forward pass (pseudo-code)
def encoder_layer(x, cross):
# 1. Self-attention
x = layer_norm_1(x + dropout(self_attention(x, x, x)))
# 2. Cross-attention on global token
x_glb = x[:, -1, :] # extract global token
x_glb = layer_norm_2(x_glb + dropout(cross_attention(x_glb, cross, cross)))
# 3. Reassemble and feed-forward
x = concat(x[:, :-1, :], x_glb)
y = dropout(activation(conv1d_up(x)))
y = dropout(conv1d_down(y))
return layer_norm_3(x + y)