Principle:Scikit learn Scikit learn Semi Supervised Learning

Knowledge Sources	Scikit_learn Scikit-learn Docs
Domains	Supervised Learning, Unsupervised Learning
Last Updated	2026-02-08 15:00 GMT

Overview

Semi-supervised learning leverages both labeled and unlabeled data to build models that achieve better performance than using labeled data alone.

Description

Semi-supervised learning occupies the space between supervised and unsupervised learning, exploiting the structure revealed by unlabeled data to improve classification performance when labels are scarce or expensive to obtain. These methods are grounded in assumptions about the relationship between the input distribution and the labels: the smoothness assumption (nearby points have similar labels), the cluster assumption (data forms clusters and points in the same cluster share a label), and the manifold assumption (data lies on a low-dimensional manifold). Semi-supervised learning addresses the practical problem that obtaining labeled data is often costly while unlabeled data is abundant.

Usage

Use Label Propagation or Label Spreading when the data has a graph structure or when the cluster assumption holds, and you want to propagate labels through a similarity graph. Use Self-Training when you have a good supervised base classifier and want to iteratively expand the training set with high-confidence predictions on unlabeled data. Semi-supervised methods are most beneficial when the number of labeled samples is small relative to the total dataset, and the unlabeled data reveals useful structure (e.g., cluster boundaries or manifold geometry) that aligns with the classification task.

Theoretical Basis

Label Propagation constructs a graph where nodes are data points (both labeled and unlabeled) and edges are weighted by a similarity function. Labels propagate through the graph:

Build an affinity matrix $W$ where $W_{i j} = k (x_{i}, x_{j})$ (e.g., RBF kernel or KNN-based).
Construct the transition matrix $T = D^{- 1} W$ , where $D$ is the diagonal degree matrix.
Initialize label distributions $Y^{(0)}$ from known labels.
Iterate: $Y^{(t + 1)} = T Y^{(t)}$
Clamp labeled points to their true labels after each iteration.
Repeat until convergence.

Label Spreading is a variant that uses a normalized graph Laplacian and allows soft clamping of labeled points:

$Y^{(t + 1)} = α S Y^{(t)} + (1 - α) Y^{(0)}$

where $S = D^{- 1 / 2} W D^{- 1 / 2}$ is the normalized affinity matrix and $α \in (0, 1)$ controls the balance between propagated and initial labels. This provides regularization and is more robust to noisy labels.

The label propagation objective can be viewed as minimizing:

$\sum_{i, j} W_{i j} ‖ y_{i} - y_{j} ‖^{2}$

subject to constraints on labeled points, which encourages similar points (high $W_{i j}$ ) to have similar labels.

Self-Training:

Train a supervised classifier on the labeled data.
Use the classifier to predict labels for unlabeled data.
Add the most confident predictions (above a threshold $τ$ ) to the labeled set as pseudo-labels.
Retrain the classifier on the expanded labeled set.
Repeat until convergence or a maximum number of iterations.

Self-training is agnostic to the choice of base classifier and can be applied with any model that produces probability estimates. The key risk is error propagation: incorrect pseudo-labels reinforce themselves in subsequent iterations.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment