Principle:Fastai Fastbook Activation Functions
| Knowledge Sources | |
|---|---|
| Domains | Deep Learning, Neural Network Architecture, Approximation Theory |
| Last Updated | 2026-02-09 17:00 GMT |
Overview
An activation function is a non-linear function applied element-wise between linear layers in a neural network, enabling the network to learn non-linear relationships that a single linear transformation cannot represent.
Description
A linear layer computes y = x @ W + b. Stacking multiple linear layers without non-linearity in between is mathematically equivalent to a single linear layer (since the composition of linear functions is linear). The activation function breaks this linearity, allowing each layer to learn a distinct transformation.
The Rectified Linear Unit (ReLU) is the most widely used activation function in modern deep learning. Defined as relu(x) = max(x, 0), it simply replaces negative values with zero. Despite its simplicity, ReLU combined with linear layers satisfies the Universal Approximation Theorem: a network with at least one hidden layer and a non-linear activation can approximate any continuous function on a compact set to arbitrary accuracy, given sufficient hidden units.
Usage
Use non-linear activation functions whenever:
- You are building a neural network with more than one layer and need it to learn non-linear patterns.
- You want to move beyond a simple linear classifier to a model capable of capturing complex feature interactions.
- You need to choose between activation functions: ReLU is the default choice for hidden layers due to its computational efficiency and good gradient properties.
Theoretical Basis
Why Non-linearity Is Required
Consider two successive linear layers:
y1 = x @ W1 + b1 y2 = y1 @ W2 + b2 = (x @ W1 + b1) @ W2 + b2 = x @ (W1 @ W2) + (b1 @ W2 + b2) = x @ W_combined + b_combined
This shows that two linear layers collapse into one. No matter how many linear layers are stacked, the result is always a single linear transformation. Non-linearity between layers prevents this collapse.
ReLU Definition
relu(x) = max(x, 0) = { x if x > 0
{ 0 if x <= 0
Properties:
- Computationally efficient: Only a comparison and selection operation.
- Sparse activation: Produces exact zeros for negative inputs, leading to sparse representations.
- Non-vanishing gradient (for positive inputs): The gradient is exactly 1 for positive inputs, avoiding the vanishing gradient problem that plagues sigmoid and tanh in deep networks.
- Derivative:
relu'(x) = 1 if x > 0, else 0.
The Universal Approximation Theorem
The theorem (Cybenko 1989, Hornik 1991) states that a feedforward network with:
- At least one hidden layer
- A non-linear activation function (such as ReLU)
- Sufficiently many hidden units
can approximate any continuous function f: R^n -> R^m on a compact domain to any desired degree of accuracy. Intuitively, with ReLU, the network creates a piecewise-linear approximation: more hidden units mean shorter line segments, producing a closer fit to any target curve.
Neural Network as Layer Composition
A two-layer neural network with ReLU is:
hidden = relu(input @ W1 + b1) # first linear layer + activation output = hidden @ W2 + b2 # second linear layer
The first layer projects the input into a higher-dimensional space (e.g., 784 -> 30), applies ReLU to create non-linear features, and the second layer combines these features into the final prediction.