Principle:Scikit learn Scikit learn Feature Selection

Knowledge Sources	Scikit_learn Scikit-learn Docs
Domains	Feature Engineering, Model Selection
Last Updated	2026-02-08 15:00 GMT

Overview

Feature selection identifies and retains the most relevant features from a dataset while discarding redundant or irrelevant ones, improving model performance and interpretability.

Description

Feature selection reduces the dimensionality of the input space by selecting a subset of the original features rather than transforming them (as in dimensionality reduction). It addresses overfitting, reduces computational cost, and improves model interpretability. Feature selection methods fall into three categories: filter methods that score features independently of the model, wrapper methods that evaluate feature subsets using a specific model's performance, and embedded methods that perform selection as part of the model training process. Feature selection is a critical component of the feature engineering pipeline, especially when dealing with high-dimensional datasets.

Usage

Use filter methods (SelectKBest, VarianceThreshold) for fast, model-agnostic feature screening as a preprocessing step. Use wrapper methods (RFE, SequentialFeatureSelector) when you want to optimize feature subsets specifically for a given estimator and can afford the additional computational cost. Use embedded methods (SelectFromModel with L1-regularized models or tree-based feature importances) when feature selection should be integrated with model training. Use mutual information-based scoring when features have non-linear relationships with the target. Use variance threshold as a simple baseline to remove constant or near-constant features.

Theoretical Basis

Filter Methods score each feature independently using a statistical test:

Variance Threshold: Remove features with variance below a threshold: $Var (X_{j}) = \frac{1}{n} \sum_{i = 1}^{n} (x_{i j} - {\bar{x}}_{j})^{2} < τ$

SelectKBest: Select the $k$ features with the highest scores according to a scoring function:

ANOVA F-value (for classification): $F = \frac{Between-group variance}{Within-group variance}$
Chi-squared test: $χ^{2} = \sum \frac{(O - E)^{2}}{E}$ for non-negative features
Mutual information: $I (X; Y) = \sum_{x, y} p (x, y) \log \frac{p (x, y)}{p (x) p (y)}$

Mutual information captures arbitrary (non-linear) dependencies between features and the target, unlike correlation-based measures.

Wrapper Methods search for optimal feature subsets by evaluating model performance:

Recursive Feature Elimination (RFE):

Train the model on all features.
Rank features by importance (e.g., coefficient magnitude, feature importance).
Remove the least important feature(s).
Repeat until the desired number of features is reached.

Sequential Feature Selector (SFS):

Forward selection: Start with no features; iteratively add the feature that most improves cross-validated performance.
Backward elimination: Start with all features; iteratively remove the feature whose removal least degrades performance.

Embedded Methods perform selection during training:

SelectFromModel uses an estimator's learned feature importances or coefficients to select features above a threshold: $selected = {j : | {\hat{β}}_{j} | > τ}$

For L1-regularized models, many coefficients are exactly zero, providing natural feature selection. For tree-based models, feature importance is typically measured by the total reduction in impurity contributed by each feature.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment