Principle:Sktime Pytorch forecasting KAN Architecture

Knowledge Sources	KAN: Kolmogorov-Arnold Networks Zero Shot Forecasting Using KAN pytorch-forecasting
Domains	Time_Series, Forecasting, Deep_Learning, Approximation_Theory
Last Updated	2026-02-08 09:00 GMT

Overview

Kolmogorov-Arnold Network (KAN) architecture that replaces fixed activation functions in MLPs with learnable univariate B-spline functions on network edges, integrated into the N-BEATS framework to create the NBeatsKAN variant for time series forecasting.

Description

The KAN (Kolmogorov-Arnold Network) architecture is grounded in the Kolmogorov-Arnold representation theorem, which states that any multivariate continuous function can be represented as a superposition of continuous univariate functions. In a standard MLP, each edge carries a scalar weight and activation functions are fixed at nodes. In a KAN layer, each edge instead carries a learnable univariate function parameterized as a B-spline, while nodes perform simple summation. This design enables the network to learn complex nonlinear transformations directly on its edges.

Each KANLayer maintains a grid of B-spline knot points and a set of trainable B-spline coefficients. For an input dimension in_dim and output dimension out_dim, the layer has $i n_d i m \times o u t_d i m$ independent spline functions. The forward pass computes two components for each input-output pair: (1) a residual (base) function $b (x)$ (default: SiLU activation) scaled by a learnable scale_base parameter, and (2) a spline function evaluated from the B-spline coefficients scaled by a learnable scale_sp parameter. A binary mask can enforce sparse connectivity between inputs and outputs.

The B-spline evaluation uses the recursive Cox-de Boor formula implemented in b_batch. The coef2curve function converts stored coefficients to spline curves via Einstein summation over B-spline basis functions. The curve2coef function fits coefficients to observed data via batched least squares, which is used during grid updates to adapt the knot positions to the data distribution.

Grid updating is a distinctive feature of KAN: periodically during training (controlled by the GridUpdateCallback), the B-spline grids are refined based on the actual input data distribution. The grid points are repositioned using a blend of uniform spacing and adaptive (percentile-based) spacing, controlled by the grid_eps parameter (1.0 = fully uniform, 0.0 = fully adaptive). After repositioning, the spline coefficients are re-fitted to preserve the learned function shape.

In the NBeatsKAN model, KAN layers replace the standard MLP layers in N-BEATS blocks. The model supports trend, seasonality, and generic block types, each using KAN layers for the basis function computation. The doubly-residual architecture of N-BEATS is preserved: each block produces a backcast and forecast, with residuals propagated through the stack. The KAN variant offers improved interpretability (each learned edge function can be visualized) and potential parameter efficiency.

Usage

Use the KAN Architecture (NBeatsKAN) for univariate time series forecasting when: (1) improved interpretability is desired since individual edge functions can be inspected, (2) the data contains complex nonlinear patterns that benefit from learnable activation functions, (3) parameter efficiency is important as KAN can achieve comparable accuracy with fewer parameters. The GridUpdateCallback should be added to the trainer with an appropriate update_interval to periodically refine spline grids during training. Default block configuration uses trend and seasonality stacks with B-spline order k=3 and G=5 grid intervals.

Theoretical Basis

Kolmogorov-Arnold representation theorem: Any continuous function $f : [0, 1]^{n} \to ℝ$ can be written as:

$f (x_{1}, \dots, x_{n}) = \sum_{q = 0}^{2 n} Φ_{q} (\sum_{p = 1}^{n} ϕ_{q, p} (x_{p}))$

where $ϕ_{q, p}$ and $Φ_{q}$ are continuous univariate functions.

KAN layer forward pass: For input $x \in ℝ^{d_{i n}}$ , the output $y \in ℝ^{d_{o u t}}$ is:

$y_{j} = \sum_{i = 1}^{d_{i n}} m_{i j} (s_{i j}^{b a s e} \cdot b (x_{i}) + s_{i j}^{s p} \cdot {spline}_{i j} (x_{i}))$

where $m_{i j}$ is the connectivity mask, $s_{i j}^{b a s e}$ and $s_{i j}^{s p}$ are learnable scale parameters, $b (\cdot)$ is the base function (SiLU), and ${spline}_{i j}$ is the B-spline function for edge $(i, j)$ .

B-spline basis (Cox-de Boor recursion):

$B_{i, 0} (x) = {\begin{cases} 1 & if t_{i} \leq x < t_{i + 1} \\ 0 & otherwise \end{cases}$

$B_{i, k} (x) = \frac{x - t_{i}}{t_{i + k} - t_{i}} B_{i, k - 1} (x) + \frac{t_{i + k + 1} - x}{t_{i + k + 1} - t_{i + 1}} B_{i + 1, k - 1} (x)$

Spline curve evaluation:

${spline}_{i j} (x) = \sum_{m} c_{i j, m} \cdot B_{m, k} (x)$

where $c_{i j, m}$ are the trainable B-spline coefficients and $B_{m, k}$ are the basis functions of order $k$ .

Grid adaptation: At update intervals, the grid knots are repositioned as a blend of uniform and adaptive grids:

$g_{new} = ϵ \cdot g_{uniform} + (1 - ϵ) \cdot g_{adaptive}$

where $g_{adaptive}$ is derived from percentiles of the input samples and Failed to parse (syntax error): {\displaystyle \epsilon = \text{grid\_eps}} . After repositioning, coefficients are re-estimated via least squares:

$c^{*} = \arg \min_{c} ‖ B (x) \cdot c - y (x) ‖^{2}$

Related Pages

Implemented By

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment