Implementation:Sktime Pytorch forecasting EncoderLayer
| Knowledge Sources | |
|---|---|
| Domains | Time_Series, Forecasting, Deep_Learning |
| Last Updated | 2026-02-08 08:00 GMT |
Overview
EncoderLayer is a single encoder block that combines self-attention, cross-attention with a global token mechanism, and a position-wise feedforward network for the TimeXer model.
Description
The EncoderLayer class implements one layer of the TimeXer encoder. It first applies self-attention over the full input (including a global token), then extracts the global token and applies cross-attention between the global token and the cross (exogenous) input. The cross-attended global token is merged back into the sequence, and the result is passed through a two-layer 1D convolutional feedforward network with configurable activation (ReLU or GELU). Layer normalization and residual connections are applied after each sub-block.
Usage
Use EncoderLayer as a building block within the Encoder module for constructing multi-layer TimeXer-style encoders. Each layer refines the patch-level representations through self-attention and enriches the global token via cross-attention with exogenous features before applying a feedforward transformation.
Code Reference
Source Location
- Repository: Sktime_Pytorch_forecasting
- File: pytorch_forecasting/layers/_encoders/_encoder_layer.py
- Lines: 1-73
Signature
class EncoderLayer(nn.Module):
def __init__(
self,
self_attention,
cross_attention,
d_model,
d_ff=None,
dropout=0.1,
activation="relu",
):
...
def forward(self, x, cross, x_mask=None, cross_mask=None, tau=None, delta=None):
...
Import
from pytorch_forecasting.layers import EncoderLayer
I/O Contract
Inputs
__init__ Parameters
| Name | Type | Required | Description |
|---|---|---|---|
| self_attention | nn.Module | Yes | Self-attention mechanism (e.g., AttentionLayer wrapping FullAttention). |
| cross_attention | nn.Module | Yes | Cross-attention mechanism for attending to exogenous features. |
| d_model | int | Yes | Dimension of the model embedding space. |
| d_ff | int | No | Dimension of the feedforward layer. Defaults to 4 * d_model if not specified. |
| dropout | float | No | Dropout rate. Defaults to 0.1. |
| activation | str | No | Activation function for the feedforward network: "relu" or "gelu". Defaults to "relu". |
forward Parameters
| Name | Type | Required | Description |
|---|---|---|---|
| x | torch.Tensor | Yes | Input tensor of shape (batch_size * n_vars, num_patches + 1, d_model), where the last position is the global token. |
| cross | torch.Tensor | Yes | Cross-attention input (exogenous features) of shape (batch_size, cross_len, d_model). |
| x_mask | torch.Tensor | No | Optional attention mask for self-attention. Defaults to None. |
| cross_mask | torch.Tensor | No | Optional attention mask for cross-attention. Defaults to None. |
| tau | float | No | Optional temperature parameter for attention scaling. Defaults to None. |
| delta | torch.Tensor | No | Optional positional delta for cross-attention. Defaults to None. |
Outputs
| Name | Type | Description |
|---|---|---|
| output | torch.Tensor | Encoded output tensor of same shape as input x, with updated representations from self-attention, cross-attention, and feedforward processing. |
Usage Examples
import torch
from pytorch_forecasting.layers import EncoderLayer, AttentionLayer, FullAttention
d_model = 64
n_heads = 8
# Create a single encoder layer
layer = EncoderLayer(
self_attention=AttentionLayer(FullAttention(), d_model, n_heads),
cross_attention=AttentionLayer(FullAttention(), d_model, n_heads),
d_model=d_model,
d_ff=256,
dropout=0.1,
activation="relu",
)
# Self-attention input: (batch * n_vars, num_patches + 1, d_model)
x = torch.randn(32, 7, d_model) # e.g., 4 batches * 8 vars, 6 patches + 1 global token
cross = torch.randn(4, 96, d_model) # exogenous features
output = layer(x, cross)
print(output.shape) # torch.Size([32, 7, 64])