Principle:Sktime Pytorch forecasting Residual Connection
| Knowledge Sources | |
|---|---|
| Domains | Time_Series, Forecasting, Deep_Learning, Neural_Network_Architecture |
| Last Updated | 2026-02-08 09:00 GMT |
Overview
Residual (skip) connections in feed-forward blocks that add a linear shortcut path alongside a nonlinear transformation, facilitating gradient flow and enabling effective training of deep forecasting networks.
Description
The Residual Connection principle, as implemented in the ResidualBlock, combines a nonlinear transformation path with a direct linear shortcut. The block contains two parallel pathways:
1. Nonlinear Path: The input is passed through an activation function (ReLU by default, configurable), then through a linear layer that maps from input size to output size, followed by dropout regularization.
2. Direct Linear Path (Skip Connection): The input is simultaneously passed through a bias-free linear projection from input size to output size. This path carries the raw signal forward without nonlinear distortion.
The outputs of both paths are summed element-wise, and optionally followed by layer normalization. This design allows the nonlinear path to learn a residual function on top of the identity-like linear mapping, which is easier to optimize than learning the full mapping from scratch.
A key detail is that the skip connection uses a linear projection without bias rather than a plain identity mapping. This accommodates cases where the input and output dimensions differ (e.g., at transitions between stages of the network), while still preserving the gradient-flow benefits of residual learning.
Usage
Use ResidualBlock as the fundamental building block in deep feed-forward architectures for time series forecasting (e.g., DSIPTs). Stack multiple blocks to build deep networks without suffering from vanishing gradients. Set in_size and out_size to handle dimension changes at different stages of the network. Apply apply_final_norm=False on intermediate blocks if layer normalization should be deferred.
Theoretical Basis
Standard Residual Learning:
Given an input , a residual block learns:
Where is the nonlinear transformation and is the skip connection.
In this implementation:
Where is the activation function (ReLU by default), is a weight matrix with bias, and is a bias-free projection.
Gradient flow benefit:
During backpropagation, the gradient through the residual block is:
The term ensures that gradients always have a direct path back through the network, preventing vanishing gradients even in very deep architectures.
Pseudo-code:
# ResidualBlock forward pass (pseudo-code)
def residual_block(x):
skip = linear_no_bias(x) # direct path
out = dropout(linear(activation(x))) # nonlinear path
result = skip + out
if apply_final_norm:
result = layer_norm(result)
return result