Implementation:Scikit learn Scikit learn SelfTrainingClassifier
| Knowledge Sources | |
|---|---|
| Domains | Machine Learning, Semi-Supervised Learning |
| Last Updated | 2026-02-08 15:00 GMT |
Overview
Concrete tool for semi-supervised self-training classification provided by scikit-learn.
Description
This module implements the SelfTrainingClassifier, a meta-estimator that wraps any supervised classifier to function as a semi-supervised classifier. It iteratively predicts pseudo-labels for unlabeled data and adds confident predictions to the training set. The classifier supports two selection criteria: threshold-based (labels with probability above a threshold are added) and k_best (the k most confident predictions per iteration). Iteration continues until max_iter is reached or no new pseudo-labels are added.
Usage
Use SelfTrainingClassifier when you have a supervised classifier that supports predict_proba and you want to leverage unlabeled data to improve classification performance. Mark unlabeled samples with -1 in the target array.
Code Reference
Source Location
- Repository: scikit-learn
- File: sklearn/semi_supervised/_self_training.py
Signature
class SelfTrainingClassifier(ClassifierMixin, MetaEstimatorMixin, BaseEstimator):
"""Self-training classifier."""
def __init__(
self,
estimator,
threshold=0.75,
criterion="threshold",
k_best=10,
max_iter=10,
verbose=False,
):
...
Import
from sklearn.semi_supervised import SelfTrainingClassifier
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| estimator | estimator object | Yes | Supervised classifier with fit and predict_proba methods |
| X | array-like of shape (n_samples, n_features) | Yes | Training data (labeled and unlabeled) |
| y | array-like of shape (n_samples,) | Yes | Target labels; -1 indicates unlabeled samples |
| threshold | float | No | Probability threshold for adding pseudo-labels (default: 0.75) |
| criterion | str | No | Selection criterion: 'threshold' or 'k_best' (default: 'threshold') |
| k_best | int | No | Number of best samples per iteration for k_best criterion (default: 10) |
| max_iter | int | No | Maximum number of self-training iterations (default: 10) |
Outputs
| Name | Type | Description |
|---|---|---|
| estimator_ | estimator | Fitted base estimator on labeled + pseudo-labeled data |
| labeled_iter_ | ndarray of shape (n_samples,) | Iteration in which each sample was labeled (-1 for unlabeled) |
| transduction_ | ndarray of shape (n_samples,) | Predicted labels for all samples |
| classes_ | ndarray | Unique class labels |
| n_iter_ | int | Number of self-training iterations |
| termination_condition_ | str | Reason for stopping: 'max_iter', 'no_change', or 'all_labeled' |
Usage Examples
Basic Usage
import numpy as np
from sklearn.semi_supervised import SelfTrainingClassifier
from sklearn.svm import SVC
from sklearn.datasets import load_iris
iris = load_iris()
rng = np.random.RandomState(42)
# Mark 70% of samples as unlabeled
labels = np.copy(iris.target)
mask = rng.rand(len(labels)) < 0.7
labels[mask] = -1
# Wrap SVC in SelfTrainingClassifier
svc = SVC(probability=True, gamma='auto')
self_training = SelfTrainingClassifier(svc, threshold=0.75, max_iter=10)
self_training.fit(iris.data, labels)
print("Accuracy:", self_training.score(iris.data, iris.target))
print("Iterations:", self_training.n_iter_)