Environment:DistrictDataLabs Yellowbrick Optional NLP Dependencies
| Knowledge Sources | |
|---|---|
| Domains | NLP, Visualization, Text_Analysis |
| Last Updated | 2026-02-08 05:00 GMT |
Overview
Optional NLP dependency environment adding NLTK, UMAP, and pandas for Yellowbrick text visualizers and DataFrame support.
Description
This environment extends the core Python/scikit-learn environment with optional packages needed for text corpus visualization and enhanced data handling. NLTK is required for part-of-speech tagging and word frequency visualizers. umap-learn (with its numba dependency) provides UMAP dimensionality reduction for document embedding visualization. pandas enables DataFrame-based data loading and is required by several dataset loaders. These packages are imported conditionally via try/except blocks, so the core library functions without them.
Usage
Use this environment when working with text corpus analysis (t-SNE, UMAP, POS tagging, word frequency, dispersion plots) or when using pandas DataFrames with Yellowbrick dataset loaders. The UMAP visualizer will raise an error if umap-learn is not installed; other text visualizers require NLTK data to be downloaded.
System Requirements
| Category | Requirement | Notes |
|---|---|---|
| OS | Cross-platform (Linux, macOS, Windows) | Same as core environment |
| Python | >= 3.5, < 4 | UMAP may not work on 32-bit Windows Python 2.7 (legacy) |
| Hardware | Standard CPU | UMAP/numba can benefit from multi-core CPUs |
| Disk | ~200MB additional | NLTK popular data downloads + numba cache |
Dependencies
System Packages
No additional OS-level packages are required.
Python Packages
- `nltk` >= 3.2 (required for PosTagVisualizer, FreqDistVisualizer)
- `pandas` >= 1.0.4 (required for DataFrame loading in dataset loaders)
- `umap-learn` >= 0.5 (required for UMAPVisualizer)
- `numba` >= 0.55 (transitive dependency of umap-learn)
NLTK Data Downloads:
- The `popular` NLTK dataset bundle must be downloaded for text visualizers to function.
Credentials
No credentials required.
Quick Install
# Install optional NLP dependencies
pip install nltk pandas umap-learn
# Download NLTK data (required for text visualizers)
python -m nltk.downloader popular
Code Evidence
UMAP conditional import with multi-exception handling from `yellowbrick/text/umap_vis.py:32-40`:
try:
from umap import UMAP
except ImportError:
UMAP = None
except (RuntimeError, AttributeError):
UMAP = None
warnings.warn(
"Error Importing UMAP. UMAP does not support python 2.7 on Windows 32 bit."
)
Pandas conditional import from `yellowbrick/datasets/base.py:32-35`:
try:
import pandas as pd
except ImportError:
pd = None
Duck-typed pandas detection from `yellowbrick/utils/types.py:176-183`:
def is_dataframe(obj):
try:
from pandas import DataFrame
return isinstance(obj, DataFrame)
except ImportError:
# Pandas is not a dependency, so this is scary
return obj.__class__.__name__ == "DataFrame"
Optional dependency declarations from `requirements.txt:8-11`:
## Optional Dependencies (uncomment to use)
# nltk>=3.2
# pandas>=1.0.4
# umap-learn>=0.5
Common Errors
| Error Message | Cause | Solution |
|---|---|---|
| `UMAP is None` / `YellowbrickValueError` when using UMAPVisualizer | umap-learn not installed | `pip install umap-learn` |
| `RuntimeError` or `AttributeError` on UMAP import | Unsupported platform (legacy 32-bit Windows) | Use 64-bit Python on a supported OS |
| `LookupError: ... NLTK resource not found` | NLTK data not downloaded | `python -m nltk.downloader popular` |
| `TypeError` with pandas operations | pandas not installed but DataFrame passed | `pip install pandas>=1.0.4` |
Compatibility Notes
- UMAP on Windows: Legacy 32-bit Windows with Python 2.7 is not supported. The code catches RuntimeError and AttributeError specifically for this case.
- Pandas is optional: The codebase uses duck typing (`obj.__class__.__name__ == "DataFrame"`) as a fallback when pandas is not installed. This is noted in the code as "scary" but functional.
- NLTK data: Text visualizers require the NLTK `popular` download bundle. The CI pipeline runs `python -m nltk.downloader popular` as part of setup.
- numba compilation: First use of UMAP triggers numba JIT compilation which can be slow. Subsequent calls use cached compiled functions.