Principle:Sktime Pytorch forecasting TimeXer Architecture
| Knowledge Sources | |
|---|---|
| Domains | Time_Series, Forecasting, Deep_Learning, Attention_Mechanisms, Transformer_Models |
| Last Updated | 2026-02-08 09:00 GMT |
Overview
TimeXer is a transformer-based architecture that reconciles endogenous (target) and exogenous (covariate) information for time series forecasting through a dual-representation scheme: patch-level tokens for the endogenous series and variate-level tokens for exogenous variables, bridged by a learnable global token and cross-attention.
Description
TimeXer (Time Series Transformer with eXogenous Variables) empowers the canonical transformer to jointly model intra-endogenous temporal patterns and exogenous-to-endogenous correlations without custom architectural modifications. The key design insight is that endogenous and exogenous variables require different granularity of representation.
Endogenous embedding: The target (endogenous) time series is divided into non-overlapping patches of length P. Each patch is linearly projected into d-dimensional tokens, enriched with sinusoidal positional encodings. A learnable global token is appended to the endogenous token sequence. This global token serves as the information bridge between endogenous and exogenous streams.
Exogenous embedding: Each exogenous covariate is treated at the variate level. The full context-length vector of each exogenous variable is linearly projected into a single d-dimensional token (inverted embedding), yielding one token per covariate rather than one token per time step.
Encoder with dual attention: Each encoder layer contains two attention sub-layers:
- Self-attention over endogenous patch tokens (including the global token), capturing temporal dependencies within the target series.
- Cross-attention where the global token attends to exogenous variate tokens, absorbing covariate information. The updated global token is then re-injected into the endogenous token sequence.
After layer normalization, a feed-forward network (1-D convolutions with ReLU or GELU activation) is applied.
Flatten head: After encoding, the sequence of patch tokens plus the global token is flattened and linearly projected to produce the prediction of length H for each target variable. The head supports quantile output when paired with QuantileLoss.
The architecture supports univariate (S), multivariate-single-target (MS), and multivariate-multi-target (M) forecasting modes. Two API versions exist: a v1 implementation based on BaseModelWithCovariates and a v2 implementation on TslibBaseModel for the newer data pipeline.
Usage
Use TimeXer when forecasting a target time series in the presence of exogenous covariates (e.g., weather, holidays, pricing signals). It is especially suitable when: (1) exogenous variables carry information at a different temporal granularity than the endogenous series, (2) the forecasting horizon is medium to long-term, and (3) a transformer-based approach with attention-driven interpretability is desired. The patch length P should divide the context length; a typical choice is 4 to 24 depending on series frequency.
Theoretical Basis
Patch tokenization of endogenous series:
Given an endogenous series and patch length , the number of patches is . Each patch is projected:
where is the sinusoidal positional encoding.
Global token:
A learnable parameter is appended, forming the endogenous token sequence .
Inverted exogenous embedding:
For exogenous variables, each full-length variate vector is projected:
Dual-attention encoder layer:
# Self-attention on endogenous tokens (patches + global)
Z_en = LayerNorm(Z_en + SelfAttention(Z_en, Z_en, Z_en))
# Cross-attention: global token queries exogenous tokens
z_glb = LayerNorm(z_glb + CrossAttention(Q=z_glb, K=Z_ex, V=Z_ex))
# Replace global token and apply feed-forward
Z_en[-1] = z_glb
Z_en = LayerNorm(Z_en + FeedForward(Z_en))
Output projection:
where is the prediction length and is the number of quantiles (1 for point forecasts).
Key hyperparameters:
- hidden_size (d) -- embedding dimension (typical: 256-512)
- n_heads -- attention heads (typical: 4-8, must divide hidden_size)
- e_layers -- number of encoder layers (typical: 2-6)
- patch_length (P) -- patch size for endogenous tokenization (typical: 4-24)
- d_ff -- feed-forward hidden dimension (typical: 1024-2048)
- dropout -- regularization (typical: 0.1-0.2)