Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Principle:Sktime Pytorch forecasting Residual Connection

From Leeroopedia


Knowledge Sources
Domains Time_Series, Forecasting, Deep_Learning, Neural_Network_Architecture
Last Updated 2026-02-08 09:00 GMT

Overview

Residual (skip) connections in feed-forward blocks that add a linear shortcut path alongside a nonlinear transformation, facilitating gradient flow and enabling effective training of deep forecasting networks.

Description

The Residual Connection principle, as implemented in the ResidualBlock, combines a nonlinear transformation path with a direct linear shortcut. The block contains two parallel pathways:

1. Nonlinear Path: The input is passed through an activation function (ReLU by default, configurable), then through a linear layer that maps from input size to output size, followed by dropout regularization.

2. Direct Linear Path (Skip Connection): The input is simultaneously passed through a bias-free linear projection from input size to output size. This path carries the raw signal forward without nonlinear distortion.

The outputs of both paths are summed element-wise, and optionally followed by layer normalization. This design allows the nonlinear path to learn a residual function on top of the identity-like linear mapping, which is easier to optimize than learning the full mapping from scratch.

A key detail is that the skip connection uses a linear projection without bias rather than a plain identity mapping. This accommodates cases where the input and output dimensions differ (e.g., at transitions between stages of the network), while still preserving the gradient-flow benefits of residual learning.

Usage

Use ResidualBlock as the fundamental building block in deep feed-forward architectures for time series forecasting (e.g., DSIPTs). Stack multiple blocks to build deep networks without suffering from vanishing gradients. Set in_size and out_size to handle dimension changes at different stages of the network. Apply apply_final_norm=False on intermediate blocks if layer normalization should be deferred.

Theoretical Basis

Standard Residual Learning:

Given an input x, a residual block learns:

y=(x)+𝒢(x)

Where (x) is the nonlinear transformation and 𝒢(x) is the skip connection.

In this implementation:

(x)=Dropout(W1σ(x)+b1)

𝒢(x)=Wskipx

y=LayerNorm((x)+𝒢(x))

Where σ is the activation function (ReLU by default), W1din×dout is a weight matrix with bias, and Wskipdin×dout is a bias-free projection.

Gradient flow benefit:

During backpropagation, the gradient through the residual block is:

yx=(x)x+Wskip

The Wskip term ensures that gradients always have a direct path back through the network, preventing vanishing gradients even in very deep architectures.

Pseudo-code:

# ResidualBlock forward pass (pseudo-code)
def residual_block(x):
    skip = linear_no_bias(x)           # direct path
    out = dropout(linear(activation(x)))  # nonlinear path
    result = skip + out
    if apply_final_norm:
        result = layer_norm(result)
    return result

Related Pages

Implemented By

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment