Implementation:DistrictDataLabs Yellowbrick Dataset Loaders

Knowledge Sources	Yellowbrick Yellowbrick Docs
Domains	Machine_Learning, Visualization, Data_Science
Last Updated	2026-02-08 00:00 GMT

Overview

Concrete tool for dataset loading provided by the Yellowbrick library.

Description

The yellowbrick.datasets.loaders module provides a collection of named loader functions that return bundled datasets suitable for machine learning visualization tasks. Each public loader function (e.g., load_mushroom, load_bikeshare, load_nfl, load_credit, load_occupancy) follows an identical signature and delegates to the shared internal helper _load_dataset().

The _load_dataset() helper performs three steps: (1) looks up the dataset's metadata (URL, checksum, feature names) from a JSON manifest file loaded at module import time; (2) constructs a Dataset object that handles downloading from a remote host if the data is not already cached locally, and verifies the SHA-256 signature; and (3) either returns the raw Dataset object or calls data.to_data() to extract the standard (X, y) tuple.

The module exports 11 loader functions covering classification (load_mushroom, load_credit, load_occupancy, load_spam, load_game), regression (load_concrete, load_energy, load_bikeshare), clustering (load_nfl, load_walking), and text analysis (load_hobbies) tasks. All non-corpus loaders share the same (data_home=None, return_dataset=False) signature.

Usage

Import individual loader functions from yellowbrick.datasets when you need sample data for Yellowbrick visualizer examples, testing, or benchmarking. Use the default return_dataset=False to get an (X, y) tuple ready for scikit-learn estimators, or set return_dataset=True to access the full Dataset object with metadata, alternative targets, and content descriptions.

Code Reference

Source Location

Repository: yellowbrick
File: yellowbrick/datasets/loaders.py
Lines: 54-62 (_load_dataset), 155-193 (load_credit), 196-234 (load_occupancy), 237-275 (load_mushroom), 356-394 (load_bikeshare), 479-517 (load_nfl)

Signature

# Internal helper (shared by all loaders)
def _load_dataset(name, data_home=None, return_dataset=False):
    ...

# Public loader functions (all share this signature)
def load_mushroom(data_home=None, return_dataset=False):
    ...

def load_bikeshare(data_home=None, return_dataset=False):
    ...

def load_nfl(data_home=None, return_dataset=False):
    ...

def load_credit(data_home=None, return_dataset=False):
    ...

def load_occupancy(data_home=None, return_dataset=False):
    ...

Import

from yellowbrick.datasets import load_mushroom
from yellowbrick.datasets import load_bikeshare
from yellowbrick.datasets import load_nfl
from yellowbrick.datasets import load_credit
from yellowbrick.datasets import load_occupancy

I/O Contract

Inputs

_load_dataset (internal helper)

Name	Type	Required	Description
name	str	Yes	The dataset name used as a key into the DATASETS manifest dictionary.
data_home	str	No	The path on disk where data is stored. If not passed, it is looked up from $YELLOWBRICK_DATA or the default returned by get_data_home().
return_dataset	bool	No	If True, return the raw Dataset object. If False (default), return the (X, y) tuple.

Public loader functions (all identical)

Name	Type	Required	Description
data_home	str	No	The path on disk where data is stored. If not passed, it is looked up from $YELLOWBRICK_DATA or the default returned by get_data_home().
return_dataset	bool	No	If True, return the raw Dataset object instead of X and y arrays. Default: False.

Outputs

When return_dataset=False (default)

Name	Type	Description
X	array-like with shape (n_instances, n_features)	A pandas DataFrame or numpy array describing the instance features.
y	array-like with shape (n_instances,)	A pandas Series or numpy array describing the target vector.

When return_dataset=True

Name	Type	Description
dataset	Dataset	The Yellowbrick Dataset object providing access to data in multiple formats, metadata, and content descriptions.

Usage Examples

Basic Usage

from yellowbrick.datasets import load_mushroom

# Load as (X, y) tuple for use with scikit-learn estimators
X, y = load_mushroom()

print(X.shape)  # (8123, 3)
print(y.shape)  # (8123,)

Returning the Dataset Object

from yellowbrick.datasets import load_bikeshare

# Load the full Dataset object for metadata access
dataset = load_bikeshare(return_dataset=True)

# Access data in different formats
X, y = dataset.to_data()
df = dataset.to_dataframe()

Using with a Yellowbrick Visualizer

from yellowbrick.datasets import load_credit
from yellowbrick.classifier import ClassificationReport
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split

# Load the credit dataset
X, y = load_credit()

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

# Create and fit a classification report visualizer
viz = ClassificationReport(RandomForestClassifier())
viz.fit(X_train, y_train)
viz.score(X_test, y_test)
viz.show()

Custom Data Home

from yellowbrick.datasets import load_nfl

# Specify a custom cache directory
X, y = load_nfl(data_home="/tmp/yellowbrick_data")

Available Datasets

Loader Function	Task Type	Instances	Features	Description
load_mushroom	Binary Classification	8,123	3 categorical	Mushroom edibility dataset
load_credit	Binary Classification	30,000	23 integer/real	Credit card default prediction
load_occupancy	Binary Classification	20,560	5 real-valued	Room occupancy detection (time-series)
load_bikeshare	Regression	17,379	12 integer/real	Bike sharing demand prediction
load_nfl	Clustering	494	28 mixed	NFL football receivers statistics
load_concrete	Regression	1,030	8 real-valued	Concrete compressive strength
load_energy	Multi-output Regression	768	8 real-valued	Building energy efficiency
load_spam	Binary Classification	4,600	57 integer/real	Email spam detection
load_walking	Clustering / Multi-label	149,332	Multi-variate time series	Walking activity recognition
load_game	Multiclass Classification	67,557	42 categorical	Connect-4 game outcomes
load_hobbies	Text Analysis	448 documents	Text corpus	Hobbies topic classification

Related Pages

Implements Principle

Principle:DistrictDataLabs_Yellowbrick_Dataset_Loading

Requires Environment

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment