Principle:Scikit learn Scikit learn Neural Networks

Knowledge Sources	Scikit_learn Scikit-learn Docs
Domains	Supervised Learning, Representation Learning
Last Updated	2026-02-08 15:00 GMT

Overview

Neural networks are computational models composed of layers of interconnected nodes (neurons) that learn hierarchical representations of data through iterative optimization.

Description

Neural networks model complex, non-linear relationships by composing simple parameterized functions (neurons) into layers. Each neuron applies a linear transformation followed by a non-linear activation function, and stacking multiple layers enables the network to learn increasingly abstract representations. They address the problem of approximating arbitrary continuous functions (universal approximation theorem) without requiring manual feature engineering. Multi-Layer Perceptrons (MLPs) are the classical feedforward architecture, while Restricted Boltzmann Machines (RBMs) are generative models that learn a probability distribution over inputs using an undirected graphical model structure.

Usage

Use MLP classifiers and regressors for tabular data when non-linear relationships are expected and sufficient training data is available. MLPs are appropriate when tree-based methods underperform or when automatic feature interaction learning is desired. Use RBMs for unsupervised feature learning, dimensionality reduction, or as building blocks for deep belief networks. Neural networks require careful hyperparameter tuning (number of layers, layer sizes, learning rate, regularization) and are best suited to problems where the dataset is large enough to support the model's capacity.

Theoretical Basis

Multi-Layer Perceptron (MLP) consists of an input layer, one or more hidden layers, and an output layer. For a network with $L$ hidden layers:

Forward pass: $a^{(0)} = x$ $z^{(l)} = W^{(l)} a^{(l - 1)} + b^{(l)}, l = 1, \dots, L + 1$ $a^{(l)} = σ (z^{(l)}), l = 1, \dots, L$ $\hat{y} = g (z^{(L + 1)})$

where $σ$ is the hidden layer activation function and $g$ is the output activation (identity for regression, softmax for classification).

Common activation functions:

ReLU: $σ (z) = \max (0, z)$
Sigmoid: $σ (z) = 1 / (1 + e^{- z})$
Tanh: $σ (z) = \tanh (z)$

Backpropagation computes gradients of the loss with respect to all weights using the chain rule:

$\frac{\partial L}{\partial W^{(l)}} = \frac{\partial L}{\partial z^{(l)}} \cdot (a^{(l - 1)})^{T}$

Weights are updated using gradient-based optimizers (SGD, Adam, L-BFGS):

$W^{(l)} \leftarrow W^{(l)} - η \frac{\partial L}{\partial W^{(l)}}$

Regularization techniques prevent overfitting:

L2 penalty: $α \sum_{l} ‖ W^{(l)} ‖_{F}^{2}$ added to the loss
Early stopping: Training halts when validation performance degrades

Restricted Boltzmann Machine (RBM) is an undirected graphical model with visible units $v$ and hidden units $h$ . The energy function is:

$E (v, h) = - b^{T} v - c^{T} h - v^{T} W h$

The joint probability is $p (v, h) = \frac{1}{Z} \exp (- E (v, h))$ . The conditional distributions are:

$p (h_{j} = 1 | v) = σ (c_{j} + W_{j}^{T} v)$ $p (v_{i} = 1 | h) = σ (b_{i} + W_{i} h)$

Training uses Contrastive Divergence (CD-k), which approximates the gradient of the log-likelihood using $k$ steps of Gibbs sampling.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment