Implementation:Scikit learn Scikit learn SelfTrainingClassifier

Knowledge Sources	Scikit_learn Scikit-learn Docs
Domains	Machine Learning, Semi-Supervised Learning
Last Updated	2026-02-08 15:00 GMT

Overview

Concrete tool for semi-supervised self-training classification provided by scikit-learn.

Description

This module implements the SelfTrainingClassifier, a meta-estimator that wraps any supervised classifier to function as a semi-supervised classifier. It iteratively predicts pseudo-labels for unlabeled data and adds confident predictions to the training set. The classifier supports two selection criteria: threshold-based (labels with probability above a threshold are added) and k_best (the k most confident predictions per iteration). Iteration continues until max_iter is reached or no new pseudo-labels are added.

Usage

Use SelfTrainingClassifier when you have a supervised classifier that supports predict_proba and you want to leverage unlabeled data to improve classification performance. Mark unlabeled samples with -1 in the target array.

Code Reference

Source Location

Repository: scikit-learn
File: sklearn/semi_supervised/_self_training.py

Signature

class SelfTrainingClassifier(ClassifierMixin, MetaEstimatorMixin, BaseEstimator):
    """Self-training classifier."""

    def __init__(
        self,
        estimator,
        threshold=0.75,
        criterion="threshold",
        k_best=10,
        max_iter=10,
        verbose=False,
    ):
        ...

Import

from sklearn.semi_supervised import SelfTrainingClassifier

I/O Contract

Inputs

Name	Type	Required	Description
estimator	estimator object	Yes	Supervised classifier with fit and predict_proba methods
X	array-like of shape (n_samples, n_features)	Yes	Training data (labeled and unlabeled)
y	array-like of shape (n_samples,)	Yes	Target labels; -1 indicates unlabeled samples
threshold	float	No	Probability threshold for adding pseudo-labels (default: 0.75)
criterion	str	No	Selection criterion: 'threshold' or 'k_best' (default: 'threshold')
k_best	int	No	Number of best samples per iteration for k_best criterion (default: 10)
max_iter	int	No	Maximum number of self-training iterations (default: 10)

Outputs

Name	Type	Description
estimator_	estimator	Fitted base estimator on labeled + pseudo-labeled data
labeled_iter_	ndarray of shape (n_samples,)	Iteration in which each sample was labeled (-1 for unlabeled)
transduction_	ndarray of shape (n_samples,)	Predicted labels for all samples
classes_	ndarray	Unique class labels
n_iter_	int	Number of self-training iterations
termination_condition_	str	Reason for stopping: 'max_iter', 'no_change', or 'all_labeled'

Usage Examples

Basic Usage

import numpy as np
from sklearn.semi_supervised import SelfTrainingClassifier
from sklearn.svm import SVC
from sklearn.datasets import load_iris

iris = load_iris()
rng = np.random.RandomState(42)

# Mark 70% of samples as unlabeled
labels = np.copy(iris.target)
mask = rng.rand(len(labels)) < 0.7
labels[mask] = -1

# Wrap SVC in SelfTrainingClassifier
svc = SVC(probability=True, gamma='auto')
self_training = SelfTrainingClassifier(svc, threshold=0.75, max_iter=10)
self_training.fit(iris.data, labels)
print("Accuracy:", self_training.score(iris.data, iris.target))
print("Iterations:", self_training.n_iter_)

Related Pages

Principle:Scikit_learn_Scikit_learn_Semi_Supervised_Learning

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment