Environment:DistrictDataLabs Yellowbrick Optional NLP Dependencies

Knowledge Sources	Yellowbrick requirements.txt text/umap_vis.py
Domains	NLP, Visualization, Text_Analysis
Last Updated	2026-02-08 05:00 GMT

Overview

Optional NLP dependency environment adding NLTK, UMAP, and pandas for Yellowbrick text visualizers and DataFrame support.

Description

This environment extends the core Python/scikit-learn environment with optional packages needed for text corpus visualization and enhanced data handling. NLTK is required for part-of-speech tagging and word frequency visualizers. umap-learn (with its numba dependency) provides UMAP dimensionality reduction for document embedding visualization. pandas enables DataFrame-based data loading and is required by several dataset loaders. These packages are imported conditionally via try/except blocks, so the core library functions without them.

Usage

Use this environment when working with text corpus analysis (t-SNE, UMAP, POS tagging, word frequency, dispersion plots) or when using pandas DataFrames with Yellowbrick dataset loaders. The UMAP visualizer will raise an error if umap-learn is not installed; other text visualizers require NLTK data to be downloaded.

System Requirements

Category	Requirement	Notes
OS	Cross-platform (Linux, macOS, Windows)	Same as core environment
Python	>= 3.5, < 4	UMAP may not work on 32-bit Windows Python 2.7 (legacy)
Hardware	Standard CPU	UMAP/numba can benefit from multi-core CPUs
Disk	~200MB additional	NLTK popular data downloads + numba cache

Dependencies

System Packages

No additional OS-level packages are required.

Python Packages

`nltk` >= 3.2 (required for PosTagVisualizer, FreqDistVisualizer)
`pandas` >= 1.0.4 (required for DataFrame loading in dataset loaders)
`umap-learn` >= 0.5 (required for UMAPVisualizer)
`numba` >= 0.55 (transitive dependency of umap-learn)

NLTK Data Downloads:

The `popular` NLTK dataset bundle must be downloaded for text visualizers to function.

Credentials

No credentials required.

Quick Install

# Install optional NLP dependencies
pip install nltk pandas umap-learn

# Download NLTK data (required for text visualizers)
python -m nltk.downloader popular

Code Evidence

UMAP conditional import with multi-exception handling from `yellowbrick/text/umap_vis.py:32-40`:

try:
    from umap import UMAP
except ImportError:
    UMAP = None
except (RuntimeError, AttributeError):
    UMAP = None
    warnings.warn(
        "Error Importing UMAP.  UMAP does not support python 2.7 on Windows 32 bit."
    )

Pandas conditional import from `yellowbrick/datasets/base.py:32-35`:

try:
    import pandas as pd
except ImportError:
    pd = None

Duck-typed pandas detection from `yellowbrick/utils/types.py:176-183`:

def is_dataframe(obj):
    try:
        from pandas import DataFrame
        return isinstance(obj, DataFrame)
    except ImportError:
        # Pandas is not a dependency, so this is scary
        return obj.__class__.__name__ == "DataFrame"

Optional dependency declarations from `requirements.txt:8-11`:

## Optional Dependencies (uncomment to use)
# nltk>=3.2
# pandas>=1.0.4
# umap-learn>=0.5

Common Errors

Error Message	Cause	Solution
`UMAP is None` / `YellowbrickValueError` when using UMAPVisualizer	umap-learn not installed	`pip install umap-learn`
`RuntimeError` or `AttributeError` on UMAP import	Unsupported platform (legacy 32-bit Windows)	Use 64-bit Python on a supported OS
`LookupError: ... NLTK resource not found`	NLTK data not downloaded	`python -m nltk.downloader popular`
`TypeError` with pandas operations	pandas not installed but DataFrame passed	`pip install pandas>=1.0.4`

Compatibility Notes

UMAP on Windows: Legacy 32-bit Windows with Python 2.7 is not supported. The code catches RuntimeError and AttributeError specifically for this case.
Pandas is optional: The codebase uses duck typing (`obj.__class__.__name__ == "DataFrame"`) as a fallback when pandas is not installed. This is noted in the code as "scary" but functional.
NLTK data: Text visualizers require the NLTK `popular` download bundle. The CI pipeline runs `python -m nltk.downloader popular` as part of setup.
numba compilation: First use of UMAP triggers numba JIT compilation which can be slow. Subsequent calls use cached compiled functions.

Related Pages

Implementation:DistrictDataLabs_Yellowbrick_Dataset_Loaders

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment