Workflow:DistrictDataLabs Yellowbrick Feature Analysis and Selection

Knowledge Sources	Yellowbrick Yellowbrick Docs Feature Visualizers
Domains	Machine_Learning, Feature_Engineering, Exploratory_Data_Analysis
Last Updated	2026-02-08 12:00 GMT

Overview

End-to-end process for visually exploring, ranking, and selecting features prior to model training using Yellowbrick's feature analysis visualizers.

Description

This workflow covers the visual feature analysis pipeline that typically precedes model fitting. It uses Yellowbrick's feature visualizers to understand feature relationships, detect colinearity, visualize high-dimensional data in reduced dimensions, and assess class separability. The process moves from pairwise feature ranking through multivariate visualization to dimensionality reduction, building an understanding of which features contribute most to the modeling task.

Key outputs:

Rank1D/Rank2D heatmaps showing pairwise feature correlations
Parallel coordinates plot showing per-instance feature profiles across classes
RadViz plot showing class separation in circular feature space
PCA projection showing variance-explained decomposition
Manifold embedding (t-SNE, Isomap, etc.) revealing nonlinear structure

Usage

Execute this workflow before model training when you need to understand feature relationships, detect multicolinearity, identify redundant features, or visualize class separability in feature space. This is essential for feature selection, identifying data quality issues, and building intuition about which algorithms may perform well on the dataset.

Execution Steps

Step 1: Load Dataset

Load the dataset with features and optional target labels. Yellowbrick's feature visualizers accept the same X, y format as scikit-learn transformers.

Key considerations:

Use Yellowbrick's built-in loaders (e.g., load_credit, load_occupancy, load_energy) for experimentation
Feature names can be passed to visualizers for readable axis labels
Both numeric and encoded categorical features are supported

Step 2: Rank Feature Correlations

Use the Rank1D or Rank2D visualizer to compute pairwise feature rankings. Rank2D computes a correlation matrix (Pearson, Spearman, or covariance) and renders it as a lower-triangle heatmap with color-coded magnitude.

What to look for:

Dark red/blue cells indicate strong positive/negative correlations
Highly correlated feature pairs may introduce multicolinearity
Consider removing one feature from strongly correlated pairs
Rank1D shows individual feature scores using Shapiro-Wilk or other univariate tests

Step 3: Visualize Feature Profiles

Use ParallelCoordinates or RadViz to visualize instances across all features, colored by target class. ParallelCoordinates draws each instance as a polyline across vertical feature axes. RadViz arranges features equidistantly on a unit circle and maps instances based on feature value attraction.

What to look for:

In parallel coordinates, class separation is visible where line bundles diverge at specific features
In RadViz, well-separated class clusters indicate good discriminative features
Overlapping classes suggest the feature set may not fully distinguish the target
Use normalize parameter in parallel coordinates for features on different scales

Step 4: Project with PCA

Use the PCA visualizer to project the data onto its 2 or 3 largest principal components. This linear dimensionality reduction shows the maximum-variance projection and displays explained variance ratios.

What to look for:

High explained variance ratio (>80%) in 2 components suggests data is low-dimensional
Clear class clusters in PCA space indicate linear separability
The biplot option shows feature contribution arrows in the projection space
3D projections (projection=3) can reveal additional structure

Step 5: Explore Nonlinear Structure with Manifold

Use the Manifold visualizer to apply nonlinear dimensionality reduction (t-SNE, Isomap, MDS, Spectral Embedding, or Locally Linear Embedding). This reveals structure that PCA's linear projection may miss.

Key considerations:

t-SNE is the most commonly used manifold method for visualization
Manifold methods are computationally expensive on large datasets
Perplexity and learning rate parameters strongly affect t-SNE results
The manifold visualizer supports multiple sklearn manifold algorithms through its manifold parameter

Step 6: Inspect Specific Feature Pairs

Use the JointPlotVisualizer to examine the relationship between two specific features in detail. This shows a scatter plot with marginal histograms and an optional best-fit line.

What to look for:

Outliers and data entry errors visible as isolated points
Nonlinear relationships between feature pairs
Distribution shape of each feature from the marginal histograms
Correlation strength from the scatter pattern

Execution Diagram

GitHub URL

Workflow Repository