Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Scikit learn Scikit learn SelfTrainingClassifier

From Leeroopedia


Knowledge Sources
Domains Machine Learning, Semi-Supervised Learning
Last Updated 2026-02-08 15:00 GMT

Overview

Concrete tool for semi-supervised self-training classification provided by scikit-learn.

Description

This module implements the SelfTrainingClassifier, a meta-estimator that wraps any supervised classifier to function as a semi-supervised classifier. It iteratively predicts pseudo-labels for unlabeled data and adds confident predictions to the training set. The classifier supports two selection criteria: threshold-based (labels with probability above a threshold are added) and k_best (the k most confident predictions per iteration). Iteration continues until max_iter is reached or no new pseudo-labels are added.

Usage

Use SelfTrainingClassifier when you have a supervised classifier that supports predict_proba and you want to leverage unlabeled data to improve classification performance. Mark unlabeled samples with -1 in the target array.

Code Reference

Source Location

Signature

class SelfTrainingClassifier(ClassifierMixin, MetaEstimatorMixin, BaseEstimator):
    """Self-training classifier."""

    def __init__(
        self,
        estimator,
        threshold=0.75,
        criterion="threshold",
        k_best=10,
        max_iter=10,
        verbose=False,
    ):
        ...

Import

from sklearn.semi_supervised import SelfTrainingClassifier

I/O Contract

Inputs

Name Type Required Description
estimator estimator object Yes Supervised classifier with fit and predict_proba methods
X array-like of shape (n_samples, n_features) Yes Training data (labeled and unlabeled)
y array-like of shape (n_samples,) Yes Target labels; -1 indicates unlabeled samples
threshold float No Probability threshold for adding pseudo-labels (default: 0.75)
criterion str No Selection criterion: 'threshold' or 'k_best' (default: 'threshold')
k_best int No Number of best samples per iteration for k_best criterion (default: 10)
max_iter int No Maximum number of self-training iterations (default: 10)

Outputs

Name Type Description
estimator_ estimator Fitted base estimator on labeled + pseudo-labeled data
labeled_iter_ ndarray of shape (n_samples,) Iteration in which each sample was labeled (-1 for unlabeled)
transduction_ ndarray of shape (n_samples,) Predicted labels for all samples
classes_ ndarray Unique class labels
n_iter_ int Number of self-training iterations
termination_condition_ str Reason for stopping: 'max_iter', 'no_change', or 'all_labeled'

Usage Examples

Basic Usage

import numpy as np
from sklearn.semi_supervised import SelfTrainingClassifier
from sklearn.svm import SVC
from sklearn.datasets import load_iris

iris = load_iris()
rng = np.random.RandomState(42)

# Mark 70% of samples as unlabeled
labels = np.copy(iris.target)
mask = rng.rand(len(labels)) < 0.7
labels[mask] = -1

# Wrap SVC in SelfTrainingClassifier
svc = SVC(probability=True, gamma='auto')
self_training = SelfTrainingClassifier(svc, threshold=0.75, max_iter=10)
self_training.fit(iris.data, labels)
print("Accuracy:", self_training.score(iris.data, iris.target))
print("Iterations:", self_training.n_iter_)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment