Implementation:DistrictDataLabs Yellowbrick Dataset Containers

Knowledge Sources	DistrictDataLabs_Yellowbrick
Domains	Data_Science, Datasets
Last Updated	2026-02-08 05:00 GMT

Overview

Concrete container classes for managing Yellowbrick's bundled datasets and text corpora, supporting numpy and pandas output formats.

Description

The datasets.base module provides three classes: BaseDataset handles downloading and verifying datasets from the hosted store, Dataset adds loading into numpy arrays or pandas DataFrames, and Corpus handles text document collections with label-based organization. Each dataset validates its SHA256 signature on download.

Usage

Import these classes when building custom dataset loading logic or when working directly with Yellowbrick's dataset infrastructure. Most users will use the higher-level loader functions (e.g., load_mushroom), which return Dataset or Corpus objects internally.

Code Reference

Source Location

Repository: DistrictDataLabs_Yellowbrick
File: yellowbrick/datasets/base.py
Lines: 1-334

Signature

class BaseDataset:
    def __init__(self, name, url=None, signature=None, data_home=None):
        """Base functionality for Dataset and Corpus objects."""
    def download(self, replace=False): ...
    def contents(self): ...
    def README(self): ...
    def meta(self): ...
    def citation(self): ...

class Dataset(BaseDataset):
    def to_data(self): ...
    def to_numpy(self): ...
    def to_pandas(self): ...
    def to_dataframe(self): ...

class Corpus(BaseDataset):
    @property
    def root(self): ...
    @property
    def labels(self): ...
    @property
    def files(self): ...
    @property
    def data(self): ...
    @property
    def target(self): ...

Import

from yellowbrick.datasets.base import Dataset, Corpus

I/O Contract

Inputs (Dataset.to_data)

Name	Type	Required	Description
(none)	—	—	Uses data on disk from download

Outputs (Dataset.to_data)

Name	Type	Description
X	DataFrame or ndarray	Feature matrix
y	Series or ndarray	Target vector

Outputs (Corpus)

Name	Type	Description
data	list of str	Document contents
target	list of str	Document labels
labels	list of str	Unique label names

Usage Examples

from yellowbrick.datasets.base import Dataset

# Load a dataset by name
ds = Dataset("mushroom")
X, y = ds.to_pandas()
print(X.head())
print(y.value_counts())

from yellowbrick.datasets.base import Corpus

# Load a text corpus
corpus = Corpus("hobbies")
print(corpus.labels)
print(len(corpus.data))

Related Pages

Environment:DistrictDataLabs_Yellowbrick_Python_Scikit_Learn_Environment

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment