Principle:Sktime Pytorch forecasting Residual Connection

Knowledge Sources	pytorch-forecasting Deep Residual Learning
Domains	Time_Series, Forecasting, Deep_Learning, Neural_Network_Architecture
Last Updated	2026-02-08 09:00 GMT

Overview

Residual (skip) connections in feed-forward blocks that add a linear shortcut path alongside a nonlinear transformation, facilitating gradient flow and enabling effective training of deep forecasting networks.

Description

The Residual Connection principle, as implemented in the ResidualBlock, combines a nonlinear transformation path with a direct linear shortcut. The block contains two parallel pathways:

1. Nonlinear Path: The input is passed through an activation function (ReLU by default, configurable), then through a linear layer that maps from input size to output size, followed by dropout regularization.

2. Direct Linear Path (Skip Connection): The input is simultaneously passed through a bias-free linear projection from input size to output size. This path carries the raw signal forward without nonlinear distortion.

The outputs of both paths are summed element-wise, and optionally followed by layer normalization. This design allows the nonlinear path to learn a residual function on top of the identity-like linear mapping, which is easier to optimize than learning the full mapping from scratch.

A key detail is that the skip connection uses a linear projection without bias rather than a plain identity mapping. This accommodates cases where the input and output dimensions differ (e.g., at transitions between stages of the network), while still preserving the gradient-flow benefits of residual learning.

Usage

Use ResidualBlock as the fundamental building block in deep feed-forward architectures for time series forecasting (e.g., DSIPTs). Stack multiple blocks to build deep networks without suffering from vanishing gradients. Set in_size and out_size to handle dimension changes at different stages of the network. Apply apply_final_norm=False on intermediate blocks if layer normalization should be deferred.

Theoretical Basis

Standard Residual Learning:

Given an input $x$ , a residual block learns:

$y = ℱ (x) + 𝒢 (x)$

Where $ℱ (x)$ is the nonlinear transformation and $𝒢 (x)$ is the skip connection.

In this implementation:

$ℱ (x) = Dropout (W_{1} \cdot σ (x) + b_{1})$

$𝒢 (x) = W_{skip} \cdot x$

$y = LayerNorm (ℱ (x) + 𝒢 (x))$

Where $σ$ is the activation function (ReLU by default), $W_{1} \in ℝ^{d_{in} \times d_{out}}$ is a weight matrix with bias, and $W_{skip} \in ℝ^{d_{in} \times d_{out}}$ is a bias-free projection.

Gradient flow benefit:

During backpropagation, the gradient through the residual block is:

$\frac{\partial y}{\partial x} = \frac{\partial ℱ (x)}{\partial x} + W_{skip}$

The $W_{skip}$ term ensures that gradients always have a direct path back through the network, preventing vanishing gradients even in very deep architectures.

Pseudo-code:

# ResidualBlock forward pass (pseudo-code)
def residual_block(x):
    skip = linear_no_bias(x)           # direct path
    out = dropout(linear(activation(x)))  # nonlinear path
    result = skip + out
    if apply_final_norm:
        result = layer_norm(result)
    return result

Related Pages

Implemented By

Implementation:Sktime_Pytorch_forecasting_ResidualBlock

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment