Implementation:DistrictDataLabs Yellowbrick Dataset Loaders
| Knowledge Sources | |
|---|---|
| Domains | Machine_Learning, Visualization, Data_Science |
| Last Updated | 2026-02-08 00:00 GMT |
Overview
Concrete tool for dataset loading provided by the Yellowbrick library.
Description
The yellowbrick.datasets.loaders module provides a collection of named loader functions that return bundled datasets suitable for machine learning visualization tasks. Each public loader function (e.g., load_mushroom, load_bikeshare, load_nfl, load_credit, load_occupancy) follows an identical signature and delegates to the shared internal helper _load_dataset().
The _load_dataset() helper performs three steps: (1) looks up the dataset's metadata (URL, checksum, feature names) from a JSON manifest file loaded at module import time; (2) constructs a Dataset object that handles downloading from a remote host if the data is not already cached locally, and verifies the SHA-256 signature; and (3) either returns the raw Dataset object or calls data.to_data() to extract the standard (X, y) tuple.
The module exports 11 loader functions covering classification (load_mushroom, load_credit, load_occupancy, load_spam, load_game), regression (load_concrete, load_energy, load_bikeshare), clustering (load_nfl, load_walking), and text analysis (load_hobbies) tasks. All non-corpus loaders share the same (data_home=None, return_dataset=False) signature.
Usage
Import individual loader functions from yellowbrick.datasets when you need sample data for Yellowbrick visualizer examples, testing, or benchmarking. Use the default return_dataset=False to get an (X, y) tuple ready for scikit-learn estimators, or set return_dataset=True to access the full Dataset object with metadata, alternative targets, and content descriptions.
Code Reference
Source Location
- Repository: yellowbrick
- File: yellowbrick/datasets/loaders.py
- Lines: 54-62 (_load_dataset), 155-193 (load_credit), 196-234 (load_occupancy), 237-275 (load_mushroom), 356-394 (load_bikeshare), 479-517 (load_nfl)
Signature
# Internal helper (shared by all loaders)
def _load_dataset(name, data_home=None, return_dataset=False):
...
# Public loader functions (all share this signature)
def load_mushroom(data_home=None, return_dataset=False):
...
def load_bikeshare(data_home=None, return_dataset=False):
...
def load_nfl(data_home=None, return_dataset=False):
...
def load_credit(data_home=None, return_dataset=False):
...
def load_occupancy(data_home=None, return_dataset=False):
...
Import
from yellowbrick.datasets import load_mushroom
from yellowbrick.datasets import load_bikeshare
from yellowbrick.datasets import load_nfl
from yellowbrick.datasets import load_credit
from yellowbrick.datasets import load_occupancy
I/O Contract
Inputs
_load_dataset (internal helper)
| Name | Type | Required | Description |
|---|---|---|---|
| name | str | Yes | The dataset name used as a key into the DATASETS manifest dictionary. |
| data_home | str | No | The path on disk where data is stored. If not passed, it is looked up from $YELLOWBRICK_DATA or the default returned by get_data_home(). |
| return_dataset | bool | No | If True, return the raw Dataset object. If False (default), return the (X, y) tuple. |
Public loader functions (all identical)
| Name | Type | Required | Description |
|---|---|---|---|
| data_home | str | No | The path on disk where data is stored. If not passed, it is looked up from $YELLOWBRICK_DATA or the default returned by get_data_home(). |
| return_dataset | bool | No | If True, return the raw Dataset object instead of X and y arrays. Default: False. |
Outputs
When return_dataset=False (default)
| Name | Type | Description |
|---|---|---|
| X | array-like with shape (n_instances, n_features) | A pandas DataFrame or numpy array describing the instance features. |
| y | array-like with shape (n_instances,) | A pandas Series or numpy array describing the target vector. |
When return_dataset=True
| Name | Type | Description |
|---|---|---|
| dataset | Dataset | The Yellowbrick Dataset object providing access to data in multiple formats, metadata, and content descriptions. |
Usage Examples
Basic Usage
from yellowbrick.datasets import load_mushroom
# Load as (X, y) tuple for use with scikit-learn estimators
X, y = load_mushroom()
print(X.shape) # (8123, 3)
print(y.shape) # (8123,)
Returning the Dataset Object
from yellowbrick.datasets import load_bikeshare
# Load the full Dataset object for metadata access
dataset = load_bikeshare(return_dataset=True)
# Access data in different formats
X, y = dataset.to_data()
df = dataset.to_dataframe()
Using with a Yellowbrick Visualizer
from yellowbrick.datasets import load_credit
from yellowbrick.classifier import ClassificationReport
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
# Load the credit dataset
X, y = load_credit()
# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
# Create and fit a classification report visualizer
viz = ClassificationReport(RandomForestClassifier())
viz.fit(X_train, y_train)
viz.score(X_test, y_test)
viz.show()
Custom Data Home
from yellowbrick.datasets import load_nfl
# Specify a custom cache directory
X, y = load_nfl(data_home="/tmp/yellowbrick_data")
Available Datasets
| Loader Function | Task Type | Instances | Features | Description |
|---|---|---|---|---|
| load_mushroom | Binary Classification | 8,123 | 3 categorical | Mushroom edibility dataset |
| load_credit | Binary Classification | 30,000 | 23 integer/real | Credit card default prediction |
| load_occupancy | Binary Classification | 20,560 | 5 real-valued | Room occupancy detection (time-series) |
| load_bikeshare | Regression | 17,379 | 12 integer/real | Bike sharing demand prediction |
| load_nfl | Clustering | 494 | 28 mixed | NFL football receivers statistics |
| load_concrete | Regression | 1,030 | 8 real-valued | Concrete compressive strength |
| load_energy | Multi-output Regression | 768 | 8 real-valued | Building energy efficiency |
| load_spam | Binary Classification | 4,600 | 57 integer/real | Email spam detection |
| load_walking | Clustering / Multi-label | 149,332 | Multi-variate time series | Walking activity recognition |
| load_game | Multiclass Classification | 67,557 | 42 categorical | Connect-4 game outcomes |
| load_hobbies | Text Analysis | 448 documents | Text corpus | Hobbies topic classification |