Implementation:Sktime Pytorch forecasting EncoderLayer

Knowledge Sources	Sktime_Pytorch_forecasting
Domains	Time_Series, Forecasting, Deep_Learning
Last Updated	2026-02-08 08:00 GMT

Overview

EncoderLayer is a single encoder block that combines self-attention, cross-attention with a global token mechanism, and a position-wise feedforward network for the TimeXer model.

Description

The EncoderLayer class implements one layer of the TimeXer encoder. It first applies self-attention over the full input (including a global token), then extracts the global token and applies cross-attention between the global token and the cross (exogenous) input. The cross-attended global token is merged back into the sequence, and the result is passed through a two-layer 1D convolutional feedforward network with configurable activation (ReLU or GELU). Layer normalization and residual connections are applied after each sub-block.

Usage

Use EncoderLayer as a building block within the Encoder module for constructing multi-layer TimeXer-style encoders. Each layer refines the patch-level representations through self-attention and enriches the global token via cross-attention with exogenous features before applying a feedforward transformation.

Code Reference

Source Location

Repository: Sktime_Pytorch_forecasting
File: pytorch_forecasting/layers/_encoders/_encoder_layer.py
Lines: 1-73

Signature

class EncoderLayer(nn.Module):
    def __init__(
        self,
        self_attention,
        cross_attention,
        d_model,
        d_ff=None,
        dropout=0.1,
        activation="relu",
    ):
        ...

    def forward(self, x, cross, x_mask=None, cross_mask=None, tau=None, delta=None):
        ...

Import

from pytorch_forecasting.layers import EncoderLayer

I/O Contract

Inputs

init Parameters

Name	Type	Required	Description
self_attention	nn.Module	Yes	Self-attention mechanism (e.g., AttentionLayer wrapping FullAttention).
cross_attention	nn.Module	Yes	Cross-attention mechanism for attending to exogenous features.
d_model	int	Yes	Dimension of the model embedding space.
d_ff	int	No	Dimension of the feedforward layer. Defaults to 4 * d_model if not specified.
dropout	float	No	Dropout rate. Defaults to 0.1.
activation	str	No	Activation function for the feedforward network: "relu" or "gelu". Defaults to "relu".

forward Parameters

Name	Type	Required	Description
x	torch.Tensor	Yes	Input tensor of shape (batch_size * n_vars, num_patches + 1, d_model), where the last position is the global token.
cross	torch.Tensor	Yes	Cross-attention input (exogenous features) of shape (batch_size, cross_len, d_model).
x_mask	torch.Tensor	No	Optional attention mask for self-attention. Defaults to None.
cross_mask	torch.Tensor	No	Optional attention mask for cross-attention. Defaults to None.
tau	float	No	Optional temperature parameter for attention scaling. Defaults to None.
delta	torch.Tensor	No	Optional positional delta for cross-attention. Defaults to None.

Outputs

Name	Type	Description
output	torch.Tensor	Encoded output tensor of same shape as input x, with updated representations from self-attention, cross-attention, and feedforward processing.

Usage Examples

import torch
from pytorch_forecasting.layers import EncoderLayer, AttentionLayer, FullAttention

d_model = 64
n_heads = 8

# Create a single encoder layer
layer = EncoderLayer(
    self_attention=AttentionLayer(FullAttention(), d_model, n_heads),
    cross_attention=AttentionLayer(FullAttention(), d_model, n_heads),
    d_model=d_model,
    d_ff=256,
    dropout=0.1,
    activation="relu",
)

# Self-attention input: (batch * n_vars, num_patches + 1, d_model)
x = torch.randn(32, 7, d_model)     # e.g., 4 batches * 8 vars, 6 patches + 1 global token
cross = torch.randn(4, 96, d_model)  # exogenous features

output = layer(x, cross)
print(output.shape)  # torch.Size([32, 7, 64])

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment