Principle:DistrictDataLabs Yellowbrick Feature Ranking
| Knowledge Sources | |
|---|---|
| Domains | Machine_Learning, Feature_Analysis, Visualization |
| Last Updated | 2026-02-08 00:00 GMT |
Overview
Feature ranking is the process of scoring and ordering individual features or feature pairs according to a statistical measure of their quality, relevance, or distributional characteristics.
Description
Feature ranking assigns a numeric score to each feature (in the univariate case) or to each pair of features (in the bivariate case) using a chosen statistical algorithm, then presents those scores visually so that analysts can quickly identify the most informative or problematic dimensions in a dataset.
In the univariate (1D) case, each feature is evaluated independently. A common algorithm is the Shapiro-Wilk test, which measures how closely a feature's distribution resembles a normal distribution. Features with high Shapiro-Wilk scores are approximately Gaussian, which can be important for algorithms that assume normality.
In the bivariate (2D) case, every pair of features is compared to produce a symmetric matrix of scores. Common algorithms include Pearson correlation (linear relationship), Spearman rank correlation (monotonic relationship), Kendall tau (ordinal association), and covariance (joint variability). These pairwise scores reveal redundancy between features, potential multicollinearity, and clusters of correlated variables.
Usage
Feature ranking is used during exploratory data analysis and feature selection to:
- Identify uninformative features that have low variance or unusual distributions.
- Detect multicollinearity by finding pairs of features with very high correlation.
- Guide feature selection by revealing which features carry independent information.
- Validate preprocessing by confirming that normalization or scaling has produced the expected distributional properties.
It is especially useful when the number of features is moderate (up to a few hundred), since the pairwise comparison produces an matrix that becomes unwieldy for very high-dimensional data.
Theoretical Basis
Univariate Ranking: Shapiro-Wilk Test
The Shapiro-Wilk test statistic is defined as:
where are the ordered sample values, is the sample mean, and are constants generated from the expected values of order statistics of a standard normal distribution. A value of close to 1 indicates normality.
Bivariate Ranking: Pearson Correlation
The Pearson correlation coefficient between features and is:
This yields values in , where denotes perfect linear dependence.
Bivariate Ranking: Spearman Rank Correlation
Spearman's applies the Pearson formula to the rank-transformed data, making it robust to non-linear but monotonic relationships.
Bivariate Ranking: Kendall Tau
Kendall's counts the number of concordant and discordant pairs: