Principle:Fastai Fastbook Activation Functions

Knowledge Sources	Nair & Hinton (2010), "Rectified Linear Units Improve Restricted Boltzmann Machines" Cybenko (1989), "Approximation by superpositions of a sigmoidal function" Hornik (1991), "Approximation capabilities of multilayer feedforward networks" Deep Learning for Coders with fastai and PyTorch
Domains	Deep Learning, Neural Network Architecture, Approximation Theory
Last Updated	2026-02-09 17:00 GMT

Overview

An activation function is a non-linear function applied element-wise between linear layers in a neural network, enabling the network to learn non-linear relationships that a single linear transformation cannot represent.

Description

A linear layer computes y = x @ W + b. Stacking multiple linear layers without non-linearity in between is mathematically equivalent to a single linear layer (since the composition of linear functions is linear). The activation function breaks this linearity, allowing each layer to learn a distinct transformation.

The Rectified Linear Unit (ReLU) is the most widely used activation function in modern deep learning. Defined as relu(x) = max(x, 0), it simply replaces negative values with zero. Despite its simplicity, ReLU combined with linear layers satisfies the Universal Approximation Theorem: a network with at least one hidden layer and a non-linear activation can approximate any continuous function on a compact set to arbitrary accuracy, given sufficient hidden units.

Usage

Use non-linear activation functions whenever:

You are building a neural network with more than one layer and need it to learn non-linear patterns.
You want to move beyond a simple linear classifier to a model capable of capturing complex feature interactions.
You need to choose between activation functions: ReLU is the default choice for hidden layers due to its computational efficiency and good gradient properties.

Theoretical Basis

Why Non-linearity Is Required

Consider two successive linear layers:

y1 = x @ W1 + b1
y2 = y1 @ W2 + b2
   = (x @ W1 + b1) @ W2 + b2
   = x @ (W1 @ W2) + (b1 @ W2 + b2)
   = x @ W_combined + b_combined

This shows that two linear layers collapse into one. No matter how many linear layers are stacked, the result is always a single linear transformation. Non-linearity between layers prevents this collapse.

ReLU Definition

relu(x) = max(x, 0) = { x  if x > 0
                       { 0  if x <= 0

Properties:

Computationally efficient: Only a comparison and selection operation.
Sparse activation: Produces exact zeros for negative inputs, leading to sparse representations.
Non-vanishing gradient (for positive inputs): The gradient is exactly 1 for positive inputs, avoiding the vanishing gradient problem that plagues sigmoid and tanh in deep networks.
Derivative: relu'(x) = 1 if x > 0, else 0.

The Universal Approximation Theorem

The theorem (Cybenko 1989, Hornik 1991) states that a feedforward network with:

At least one hidden layer
A non-linear activation function (such as ReLU)
Sufficiently many hidden units

can approximate any continuous function f: R^n -> R^m on a compact domain to any desired degree of accuracy. Intuitively, with ReLU, the network creates a piecewise-linear approximation: more hidden units mean shorter line segments, producing a closer fit to any target curve.

Neural Network as Layer Composition

A two-layer neural network with ReLU is:

hidden = relu(input @ W1 + b1)      # first linear layer + activation
output = hidden @ W2 + b2           # second linear layer

The first layer projects the input into a higher-dimensional space (e.g., 784 -> 30), applies ReLU to create non-linear features, and the second layer combines these features into the final prediction.

Related Pages

Implemented By

Implementation:Fastai_Fastbook_NN_Sequential

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment