Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:DistrictDataLabs Yellowbrick Dataset Loaders

From Leeroopedia


Knowledge Sources
Domains Machine_Learning, Visualization, Data_Science
Last Updated 2026-02-08 00:00 GMT

Overview

Concrete tool for dataset loading provided by the Yellowbrick library.

Description

The yellowbrick.datasets.loaders module provides a collection of named loader functions that return bundled datasets suitable for machine learning visualization tasks. Each public loader function (e.g., load_mushroom, load_bikeshare, load_nfl, load_credit, load_occupancy) follows an identical signature and delegates to the shared internal helper _load_dataset().

The _load_dataset() helper performs three steps: (1) looks up the dataset's metadata (URL, checksum, feature names) from a JSON manifest file loaded at module import time; (2) constructs a Dataset object that handles downloading from a remote host if the data is not already cached locally, and verifies the SHA-256 signature; and (3) either returns the raw Dataset object or calls data.to_data() to extract the standard (X, y) tuple.

The module exports 11 loader functions covering classification (load_mushroom, load_credit, load_occupancy, load_spam, load_game), regression (load_concrete, load_energy, load_bikeshare), clustering (load_nfl, load_walking), and text analysis (load_hobbies) tasks. All non-corpus loaders share the same (data_home=None, return_dataset=False) signature.

Usage

Import individual loader functions from yellowbrick.datasets when you need sample data for Yellowbrick visualizer examples, testing, or benchmarking. Use the default return_dataset=False to get an (X, y) tuple ready for scikit-learn estimators, or set return_dataset=True to access the full Dataset object with metadata, alternative targets, and content descriptions.

Code Reference

Source Location

  • Repository: yellowbrick
  • File: yellowbrick/datasets/loaders.py
  • Lines: 54-62 (_load_dataset), 155-193 (load_credit), 196-234 (load_occupancy), 237-275 (load_mushroom), 356-394 (load_bikeshare), 479-517 (load_nfl)

Signature

# Internal helper (shared by all loaders)
def _load_dataset(name, data_home=None, return_dataset=False):
    ...

# Public loader functions (all share this signature)
def load_mushroom(data_home=None, return_dataset=False):
    ...

def load_bikeshare(data_home=None, return_dataset=False):
    ...

def load_nfl(data_home=None, return_dataset=False):
    ...

def load_credit(data_home=None, return_dataset=False):
    ...

def load_occupancy(data_home=None, return_dataset=False):
    ...

Import

from yellowbrick.datasets import load_mushroom
from yellowbrick.datasets import load_bikeshare
from yellowbrick.datasets import load_nfl
from yellowbrick.datasets import load_credit
from yellowbrick.datasets import load_occupancy

I/O Contract

Inputs

_load_dataset (internal helper)

Name Type Required Description
name str Yes The dataset name used as a key into the DATASETS manifest dictionary.
data_home str No The path on disk where data is stored. If not passed, it is looked up from $YELLOWBRICK_DATA or the default returned by get_data_home().
return_dataset bool No If True, return the raw Dataset object. If False (default), return the (X, y) tuple.

Public loader functions (all identical)

Name Type Required Description
data_home str No The path on disk where data is stored. If not passed, it is looked up from $YELLOWBRICK_DATA or the default returned by get_data_home().
return_dataset bool No If True, return the raw Dataset object instead of X and y arrays. Default: False.

Outputs

When return_dataset=False (default)

Name Type Description
X array-like with shape (n_instances, n_features) A pandas DataFrame or numpy array describing the instance features.
y array-like with shape (n_instances,) A pandas Series or numpy array describing the target vector.

When return_dataset=True

Name Type Description
dataset Dataset The Yellowbrick Dataset object providing access to data in multiple formats, metadata, and content descriptions.

Usage Examples

Basic Usage

from yellowbrick.datasets import load_mushroom

# Load as (X, y) tuple for use with scikit-learn estimators
X, y = load_mushroom()

print(X.shape)  # (8123, 3)
print(y.shape)  # (8123,)

Returning the Dataset Object

from yellowbrick.datasets import load_bikeshare

# Load the full Dataset object for metadata access
dataset = load_bikeshare(return_dataset=True)

# Access data in different formats
X, y = dataset.to_data()
df = dataset.to_dataframe()

Using with a Yellowbrick Visualizer

from yellowbrick.datasets import load_credit
from yellowbrick.classifier import ClassificationReport
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split

# Load the credit dataset
X, y = load_credit()

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

# Create and fit a classification report visualizer
viz = ClassificationReport(RandomForestClassifier())
viz.fit(X_train, y_train)
viz.score(X_test, y_test)
viz.show()

Custom Data Home

from yellowbrick.datasets import load_nfl

# Specify a custom cache directory
X, y = load_nfl(data_home="/tmp/yellowbrick_data")

Available Datasets

Loader Function Task Type Instances Features Description
load_mushroom Binary Classification 8,123 3 categorical Mushroom edibility dataset
load_credit Binary Classification 30,000 23 integer/real Credit card default prediction
load_occupancy Binary Classification 20,560 5 real-valued Room occupancy detection (time-series)
load_bikeshare Regression 17,379 12 integer/real Bike sharing demand prediction
load_nfl Clustering 494 28 mixed NFL football receivers statistics
load_concrete Regression 1,030 8 real-valued Concrete compressive strength
load_energy Multi-output Regression 768 8 real-valued Building energy efficiency
load_spam Binary Classification 4,600 57 integer/real Email spam detection
load_walking Clustering / Multi-label 149,332 Multi-variate time series Walking activity recognition
load_game Multiclass Classification 67,557 42 categorical Connect-4 game outcomes
load_hobbies Text Analysis 448 documents Text corpus Hobbies topic classification

Related Pages

Implements Principle

Requires Environment

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment