Implementation:Online ml River FeatureSelection SelectKBest
| Knowledge Sources | |
|---|---|
| Domains | Online_Learning, Feature_Selection, Supervised_Learning |
| Last Updated | 2026-02-08 16:00 GMT |
Overview
Selects the k features with highest scores based on a similarity metric computed incrementally.
Description
SelectKBest maintains running similarity statistics (like Pearson correlation) between each feature and the target. It ranks features by their similarity scores and retains only the top k features during transformation. The similarity metrics are updated incrementally as new samples arrive. The use_abs parameter allows ranking by absolute similarity values, useful when negative correlations are as informative as positive ones. A leaderboard tracks current feature rankings.
Usage
Use this for supervised feature selection when you have many features and want to retain only the most predictive ones. Helps reduce dimensionality, prevent overfitting, and improve model interpretability. Common similarity metrics include Pearson correlation for numeric targets and mutual information for classification. The streaming nature makes it suitable for high-dimensional online learning where memory is constrained.
Code Reference
Source Location
- Repository: Online_ml_River
- File: river/feature_selection/k_best.py
Signature
class SelectKBest(base.SupervisedTransformer):
def __init__(self, similarity: stats.base.Bivariate, k=10, use_abs: bool = False)
Import
from river import feature_selection
from river import stats
I/O Contract
| Input | Output |
|---|---|
| Dict[str, float] - All features | Dict[str, float] - Top k features |
Usage Examples
from pprint import pprint
from river import feature_selection
from river import stats
from river import stream
from sklearn import datasets
X, y = datasets.make_regression(
n_samples=100,
n_features=10,
n_informative=2,
random_state=42
)
selector = feature_selection.SelectKBest(
similarity=stats.PearsonCorr(),
k=2
)
for xi, yi, in stream.iter_array(X, y):
selector.learn_one(xi, yi)
pprint(selector.leaderboard)
# Counter({9: 0.7898,
# 7: 0.5444,
# 8: 0.1062,
# ...})
selector.transform_one(xi)
# {7: -1.2795, 9: -1.8408}
# Using use_abs parameter for negative correlations
import random
random.seed(42)
X_abs = [[random.random() for _ in range(3)] for _ in range(100)]
y_abs = [0.6 * x[0] - 0.9 * x[1] + 0.1 * x[2] + random.gauss(0, 0.1) for x in X_abs]
selector_with_abs = feature_selection.SelectKBest(
stats.PearsonCorr(),
k=1,
use_abs=True
)
for xi, yi in stream.iter_array(X_abs, y_abs):
selector_with_abs.learn_one(xi, yi)
selector_with_abs.transform_one({i: v for i, v in enumerate(X_abs[-1])})
# {1: 0.07524386007376704} # Selected feature 1 due to high absolute correlation