Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Environment:DistrictDataLabs Yellowbrick Optional NLP Dependencies

From Leeroopedia


Knowledge Sources
Domains NLP, Visualization, Text_Analysis
Last Updated 2026-02-08 05:00 GMT

Overview

Optional NLP dependency environment adding NLTK, UMAP, and pandas for Yellowbrick text visualizers and DataFrame support.

Description

This environment extends the core Python/scikit-learn environment with optional packages needed for text corpus visualization and enhanced data handling. NLTK is required for part-of-speech tagging and word frequency visualizers. umap-learn (with its numba dependency) provides UMAP dimensionality reduction for document embedding visualization. pandas enables DataFrame-based data loading and is required by several dataset loaders. These packages are imported conditionally via try/except blocks, so the core library functions without them.

Usage

Use this environment when working with text corpus analysis (t-SNE, UMAP, POS tagging, word frequency, dispersion plots) or when using pandas DataFrames with Yellowbrick dataset loaders. The UMAP visualizer will raise an error if umap-learn is not installed; other text visualizers require NLTK data to be downloaded.

System Requirements

Category Requirement Notes
OS Cross-platform (Linux, macOS, Windows) Same as core environment
Python >= 3.5, < 4 UMAP may not work on 32-bit Windows Python 2.7 (legacy)
Hardware Standard CPU UMAP/numba can benefit from multi-core CPUs
Disk ~200MB additional NLTK popular data downloads + numba cache

Dependencies

System Packages

No additional OS-level packages are required.

Python Packages

  • `nltk` >= 3.2 (required for PosTagVisualizer, FreqDistVisualizer)
  • `pandas` >= 1.0.4 (required for DataFrame loading in dataset loaders)
  • `umap-learn` >= 0.5 (required for UMAPVisualizer)
  • `numba` >= 0.55 (transitive dependency of umap-learn)

NLTK Data Downloads:

  • The `popular` NLTK dataset bundle must be downloaded for text visualizers to function.

Credentials

No credentials required.

Quick Install

# Install optional NLP dependencies
pip install nltk pandas umap-learn

# Download NLTK data (required for text visualizers)
python -m nltk.downloader popular

Code Evidence

UMAP conditional import with multi-exception handling from `yellowbrick/text/umap_vis.py:32-40`:

try:
    from umap import UMAP
except ImportError:
    UMAP = None
except (RuntimeError, AttributeError):
    UMAP = None
    warnings.warn(
        "Error Importing UMAP.  UMAP does not support python 2.7 on Windows 32 bit."
    )

Pandas conditional import from `yellowbrick/datasets/base.py:32-35`:

try:
    import pandas as pd
except ImportError:
    pd = None

Duck-typed pandas detection from `yellowbrick/utils/types.py:176-183`:

def is_dataframe(obj):
    try:
        from pandas import DataFrame
        return isinstance(obj, DataFrame)
    except ImportError:
        # Pandas is not a dependency, so this is scary
        return obj.__class__.__name__ == "DataFrame"

Optional dependency declarations from `requirements.txt:8-11`:

## Optional Dependencies (uncomment to use)
# nltk>=3.2
# pandas>=1.0.4
# umap-learn>=0.5

Common Errors

Error Message Cause Solution
`UMAP is None` / `YellowbrickValueError` when using UMAPVisualizer umap-learn not installed `pip install umap-learn`
`RuntimeError` or `AttributeError` on UMAP import Unsupported platform (legacy 32-bit Windows) Use 64-bit Python on a supported OS
`LookupError: ... NLTK resource not found` NLTK data not downloaded `python -m nltk.downloader popular`
`TypeError` with pandas operations pandas not installed but DataFrame passed `pip install pandas>=1.0.4`

Compatibility Notes

  • UMAP on Windows: Legacy 32-bit Windows with Python 2.7 is not supported. The code catches RuntimeError and AttributeError specifically for this case.
  • Pandas is optional: The codebase uses duck typing (`obj.__class__.__name__ == "DataFrame"`) as a fallback when pandas is not installed. This is noted in the code as "scary" but functional.
  • NLTK data: Text visualizers require the NLTK `popular` download bundle. The CI pipeline runs `python -m nltk.downloader popular` as part of setup.
  • numba compilation: First use of UMAP triggers numba JIT compilation which can be slow. Subsequent calls use cached compiled functions.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment