Workflow:DistrictDataLabs Yellowbrick Feature Analysis and Selection
| Knowledge Sources | |
|---|---|
| Domains | Machine_Learning, Feature_Engineering, Exploratory_Data_Analysis |
| Last Updated | 2026-02-08 12:00 GMT |
Overview
End-to-end process for visually exploring, ranking, and selecting features prior to model training using Yellowbrick's feature analysis visualizers.
Description
This workflow covers the visual feature analysis pipeline that typically precedes model fitting. It uses Yellowbrick's feature visualizers to understand feature relationships, detect colinearity, visualize high-dimensional data in reduced dimensions, and assess class separability. The process moves from pairwise feature ranking through multivariate visualization to dimensionality reduction, building an understanding of which features contribute most to the modeling task.
Key outputs:
- Rank1D/Rank2D heatmaps showing pairwise feature correlations
- Parallel coordinates plot showing per-instance feature profiles across classes
- RadViz plot showing class separation in circular feature space
- PCA projection showing variance-explained decomposition
- Manifold embedding (t-SNE, Isomap, etc.) revealing nonlinear structure
Usage
Execute this workflow before model training when you need to understand feature relationships, detect multicolinearity, identify redundant features, or visualize class separability in feature space. This is essential for feature selection, identifying data quality issues, and building intuition about which algorithms may perform well on the dataset.
Execution Steps
Step 1: Load Dataset
Load the dataset with features and optional target labels. Yellowbrick's feature visualizers accept the same X, y format as scikit-learn transformers.
Key considerations:
- Use Yellowbrick's built-in loaders (e.g., load_credit, load_occupancy, load_energy) for experimentation
- Feature names can be passed to visualizers for readable axis labels
- Both numeric and encoded categorical features are supported
Step 2: Rank Feature Correlations
Use the Rank1D or Rank2D visualizer to compute pairwise feature rankings. Rank2D computes a correlation matrix (Pearson, Spearman, or covariance) and renders it as a lower-triangle heatmap with color-coded magnitude.
What to look for:
- Dark red/blue cells indicate strong positive/negative correlations
- Highly correlated feature pairs may introduce multicolinearity
- Consider removing one feature from strongly correlated pairs
- Rank1D shows individual feature scores using Shapiro-Wilk or other univariate tests
Step 3: Visualize Feature Profiles
Use ParallelCoordinates or RadViz to visualize instances across all features, colored by target class. ParallelCoordinates draws each instance as a polyline across vertical feature axes. RadViz arranges features equidistantly on a unit circle and maps instances based on feature value attraction.
What to look for:
- In parallel coordinates, class separation is visible where line bundles diverge at specific features
- In RadViz, well-separated class clusters indicate good discriminative features
- Overlapping classes suggest the feature set may not fully distinguish the target
- Use normalize parameter in parallel coordinates for features on different scales
Step 4: Project with PCA
Use the PCA visualizer to project the data onto its 2 or 3 largest principal components. This linear dimensionality reduction shows the maximum-variance projection and displays explained variance ratios.
What to look for:
- High explained variance ratio (>80%) in 2 components suggests data is low-dimensional
- Clear class clusters in PCA space indicate linear separability
- The biplot option shows feature contribution arrows in the projection space
- 3D projections (projection=3) can reveal additional structure
Step 5: Explore Nonlinear Structure with Manifold
Use the Manifold visualizer to apply nonlinear dimensionality reduction (t-SNE, Isomap, MDS, Spectral Embedding, or Locally Linear Embedding). This reveals structure that PCA's linear projection may miss.
Key considerations:
- t-SNE is the most commonly used manifold method for visualization
- Manifold methods are computationally expensive on large datasets
- Perplexity and learning rate parameters strongly affect t-SNE results
- The manifold visualizer supports multiple sklearn manifold algorithms through its manifold parameter
Step 6: Inspect Specific Feature Pairs
Use the JointPlotVisualizer to examine the relationship between two specific features in detail. This shows a scatter plot with marginal histograms and an optional best-fit line.
What to look for:
- Outliers and data entry errors visible as isolated points
- Nonlinear relationships between feature pairs
- Distribution shape of each feature from the marginal histograms
- Correlation strength from the scatter pattern