Heuristic:DistrictDataLabs Yellowbrick NaN Data Handling
| Knowledge Sources | |
|---|---|
| Domains | Data_Quality, Visualization |
| Last Updated | 2026-02-08 05:00 GMT |
Overview
Warn-then-filter pattern for handling NaN values: always warn the user about missing data percentage, then silently drop incomplete rows before plotting.
Description
Yellowbrick implements a consistent two-step pattern for dealing with NaN (Not a Number) values in input data. First, it warns the user by reporting the exact count and percentage of rows containing NaN values. Then, it filters those rows out so that only complete data is plotted. This approach ensures the user is aware of data quality issues while still producing a usable visualization. The pattern uses a custom `DataWarning` class (subclass of `YellowbrickWarning` and `UserWarning`) so warnings can be programmatically caught or suppressed.
Usage
Apply this heuristic whenever a visualizer receives raw user data that may contain missing values. This is especially important for feature visualizers (RadViz, Parallel Coordinates, Rank2D) where NaN values would break matplotlib rendering or produce misleading plots. The pattern is implemented in `yellowbrick.utils.nan_warnings` and should be called in the `draw()` method before any plotting code.
The Insight (Rule of Thumb)
- Action: Call `warn_if_nans_exist(X)` followed by `X, y = filter_missing(X, y)` at the start of every `draw()` method that accepts raw feature data.
- Value: Warning includes exact count, total, and percentage: `"Found {count} rows of {total} ({percent}%) with nan values. Only complete rows will be plotted."`
- Trade-off: Users lose visibility of incomplete data points in the visualization, but the plot renders correctly. The warning message ensures this silent filtering is transparent.
- Pattern: Use `DataWarning` (not generic `UserWarning`) so downstream code can selectively catch or suppress Yellowbrick-specific data warnings.
Reasoning
Matplotlib cannot render NaN values in most plot types without raising errors or producing blank spaces. Rather than crashing or producing a broken visualization, Yellowbrick chose a defensive approach: warn and filter. This follows the Python philosophy of being explicit (the warning) while still being practical (filtering allows the visualization to complete). The custom `DataWarning` hierarchy allows users to suppress these warnings in production pipelines with `warnings.filterwarnings("ignore", category=DataWarning)` without silencing unrelated warnings.
The filter operates on rows (not individual elements) because most visualizations require complete feature vectors. If X and y are both provided, filtering is synchronized so that removing a row from X also removes the corresponding label from y, preventing shape mismatches.
Code Evidence
Warn-then-filter pattern from `yellowbrick/features/radviz.py:173-174`:
nan_warnings.warn_if_nans_exist(X)
X, y = nan_warnings.filter_missing(X, y)
Warning implementation from `yellowbrick/utils/nan_warnings.py:67-78`:
def warn_if_nans_exist(X):
"""Warn if nans exist in a numpy array."""
null_count = count_rows_with_nans(X)
total = len(X)
percent = 100 * null_count / total
if null_count > 0:
warning_message = (
"Warning! Found {} rows of {} ({:0.2f}%) with nan values. Only "
"complete rows will be plotted.".format(null_count, total, percent)
)
warnings.warn(warning_message, DataWarning)
Synchronized X/y filtering from `yellowbrick/utils/nan_warnings.py:58-64`:
def filter_missing_X_and_y(X, y):
"""Remove rows from X and y where either contains nans."""
y_nans = np.isnan(y)
x_nans = np.isnan(X).any(axis=1)
unioned_nans = np.logical_or(x_nans, y_nans)
return X[~unioned_nans], y[~unioned_nans]
Non-finite value clamping from `yellowbrick/utils/helpers.py:247`:
result[~np.isfinite(result)] = 0 # -inf inf NaN