Principle:Sktime Pytorch forecasting TimeXer Architecture

Knowledge Sources	TimeXer: Empowering Transformers for Time Series Forecasting with Exogenous Variables pytorch-forecasting
Domains	Time_Series, Forecasting, Deep_Learning, Attention_Mechanisms, Transformer_Models
Last Updated	2026-02-08 09:00 GMT

Overview

TimeXer is a transformer-based architecture that reconciles endogenous (target) and exogenous (covariate) information for time series forecasting through a dual-representation scheme: patch-level tokens for the endogenous series and variate-level tokens for exogenous variables, bridged by a learnable global token and cross-attention.

Description

TimeXer (Time Series Transformer with eXogenous Variables) empowers the canonical transformer to jointly model intra-endogenous temporal patterns and exogenous-to-endogenous correlations without custom architectural modifications. The key design insight is that endogenous and exogenous variables require different granularity of representation.

Endogenous embedding: The target (endogenous) time series is divided into non-overlapping patches of length P. Each patch is linearly projected into d-dimensional tokens, enriched with sinusoidal positional encodings. A learnable global token is appended to the endogenous token sequence. This global token serves as the information bridge between endogenous and exogenous streams.

Exogenous embedding: Each exogenous covariate is treated at the variate level. The full context-length vector of each exogenous variable is linearly projected into a single d-dimensional token (inverted embedding), yielding one token per covariate rather than one token per time step.

Encoder with dual attention: Each encoder layer contains two attention sub-layers:

Self-attention over endogenous patch tokens (including the global token), capturing temporal dependencies within the target series.
Cross-attention where the global token attends to exogenous variate tokens, absorbing covariate information. The updated global token is then re-injected into the endogenous token sequence.

After layer normalization, a feed-forward network (1-D convolutions with ReLU or GELU activation) is applied.

Flatten head: After encoding, the sequence of patch tokens plus the global token is flattened and linearly projected to produce the prediction of length H for each target variable. The head supports quantile output when paired with QuantileLoss.

The architecture supports univariate (S), multivariate-single-target (MS), and multivariate-multi-target (M) forecasting modes. Two API versions exist: a v1 implementation based on BaseModelWithCovariates and a v2 implementation on TslibBaseModel for the newer data pipeline.

Usage

Use TimeXer when forecasting a target time series in the presence of exogenous covariates (e.g., weather, holidays, pricing signals). It is especially suitable when: (1) exogenous variables carry information at a different temporal granularity than the endogenous series, (2) the forecasting horizon is medium to long-term, and (3) a transformer-based approach with attention-driven interpretability is desired. The patch length P should divide the context length; a typical choice is 4 to 24 depending on series frequency.

Theoretical Basis

Patch tokenization of endogenous series:

Given an endogenous series $x^{e n} \in ℝ^{T}$ and patch length $P$ , the number of patches is $N = ⌊ T / P ⌋$ . Each patch $p_{j} \in ℝ^{P}$ is projected:

$z_{j}^{e n} = W_{e n} p_{j} + PE (j), W_{e n} \in ℝ^{d \times P}$

where $PE (j)$ is the sinusoidal positional encoding.

Global token:

A learnable parameter $z_{g l b} \in ℝ^{d}$ is appended, forming the endogenous token sequence $Z^{e n} = [z_{1}^{e n}, \dots, z_{N}^{e n}, z_{g l b}]$ .

Inverted exogenous embedding:

For $M$ exogenous variables, each full-length variate vector $x_{m}^{e x} \in ℝ^{T}$ is projected:

$z_{m}^{e x} = W_{e x} x_{m}^{e x}, W_{e x} \in ℝ^{d \times T}$

Dual-attention encoder layer:

# Self-attention on endogenous tokens (patches + global)
Z_en = LayerNorm(Z_en + SelfAttention(Z_en, Z_en, Z_en))

# Cross-attention: global token queries exogenous tokens
z_glb = LayerNorm(z_glb + CrossAttention(Q=z_glb, K=Z_ex, V=Z_ex))

# Replace global token and apply feed-forward
Z_en[-1] = z_glb
Z_en = LayerNorm(Z_en + FeedForward(Z_en))

Output projection:

$\hat{y} = W_{o u t} Flatten (Z^{e n}) \in ℝ^{H \times Q}$

where $H$ is the prediction length and $Q$ is the number of quantiles (1 for point forecasts).

Key hyperparameters:

hidden_size (d) -- embedding dimension (typical: 256-512)
n_heads -- attention heads (typical: 4-8, must divide hidden_size)
e_layers -- number of encoder layers (typical: 2-6)
patch_length (P) -- patch size for endogenous tokenization (typical: 4-24)
d_ff -- feed-forward hidden dimension (typical: 1024-2048)
dropout -- regularization (typical: 0.1-0.2)

Related Pages

Implemented By

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment