Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Principle:Sktime Pytorch forecasting TimeXer Architecture

From Leeroopedia


Knowledge Sources
Domains Time_Series, Forecasting, Deep_Learning, Attention_Mechanisms, Transformer_Models
Last Updated 2026-02-08 09:00 GMT

Overview

TimeXer is a transformer-based architecture that reconciles endogenous (target) and exogenous (covariate) information for time series forecasting through a dual-representation scheme: patch-level tokens for the endogenous series and variate-level tokens for exogenous variables, bridged by a learnable global token and cross-attention.

Description

TimeXer (Time Series Transformer with eXogenous Variables) empowers the canonical transformer to jointly model intra-endogenous temporal patterns and exogenous-to-endogenous correlations without custom architectural modifications. The key design insight is that endogenous and exogenous variables require different granularity of representation.

Endogenous embedding: The target (endogenous) time series is divided into non-overlapping patches of length P. Each patch is linearly projected into d-dimensional tokens, enriched with sinusoidal positional encodings. A learnable global token is appended to the endogenous token sequence. This global token serves as the information bridge between endogenous and exogenous streams.

Exogenous embedding: Each exogenous covariate is treated at the variate level. The full context-length vector of each exogenous variable is linearly projected into a single d-dimensional token (inverted embedding), yielding one token per covariate rather than one token per time step.

Encoder with dual attention: Each encoder layer contains two attention sub-layers:

  1. Self-attention over endogenous patch tokens (including the global token), capturing temporal dependencies within the target series.
  2. Cross-attention where the global token attends to exogenous variate tokens, absorbing covariate information. The updated global token is then re-injected into the endogenous token sequence.

After layer normalization, a feed-forward network (1-D convolutions with ReLU or GELU activation) is applied.

Flatten head: After encoding, the sequence of patch tokens plus the global token is flattened and linearly projected to produce the prediction of length H for each target variable. The head supports quantile output when paired with QuantileLoss.

The architecture supports univariate (S), multivariate-single-target (MS), and multivariate-multi-target (M) forecasting modes. Two API versions exist: a v1 implementation based on BaseModelWithCovariates and a v2 implementation on TslibBaseModel for the newer data pipeline.

Usage

Use TimeXer when forecasting a target time series in the presence of exogenous covariates (e.g., weather, holidays, pricing signals). It is especially suitable when: (1) exogenous variables carry information at a different temporal granularity than the endogenous series, (2) the forecasting horizon is medium to long-term, and (3) a transformer-based approach with attention-driven interpretability is desired. The patch length P should divide the context length; a typical choice is 4 to 24 depending on series frequency.

Theoretical Basis

Patch tokenization of endogenous series:

Given an endogenous series xenT and patch length P, the number of patches is N=T/P. Each patch pjP is projected:

zjen=Wenpj+PE(j),Wend×P

where PE(j) is the sinusoidal positional encoding.

Global token:

A learnable parameter zglbd is appended, forming the endogenous token sequence Zen=[z1en,,zNen,zglb].

Inverted exogenous embedding:

For M exogenous variables, each full-length variate vector xmexT is projected:

zmex=Wexxmex,Wexd×T

Dual-attention encoder layer:

# Self-attention on endogenous tokens (patches + global)
Z_en = LayerNorm(Z_en + SelfAttention(Z_en, Z_en, Z_en))

# Cross-attention: global token queries exogenous tokens
z_glb = LayerNorm(z_glb + CrossAttention(Q=z_glb, K=Z_ex, V=Z_ex))

# Replace global token and apply feed-forward
Z_en[-1] = z_glb
Z_en = LayerNorm(Z_en + FeedForward(Z_en))

Output projection:

y^=WoutFlatten(Zen)H×Q

where H is the prediction length and Q is the number of quantiles (1 for point forecasts).

Key hyperparameters:

  • hidden_size (d) -- embedding dimension (typical: 256-512)
  • n_heads -- attention heads (typical: 4-8, must divide hidden_size)
  • e_layers -- number of encoder layers (typical: 2-6)
  • patch_length (P) -- patch size for endogenous tokenization (typical: 4-24)
  • d_ff -- feed-forward hidden dimension (typical: 1024-2048)
  • dropout -- regularization (typical: 0.1-0.2)

Related Pages

Implemented By

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment