Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:DistrictDataLabs Yellowbrick Dataset Containers

From Leeroopedia


Knowledge Sources
Domains Data_Science, Datasets
Last Updated 2026-02-08 05:00 GMT

Overview

Concrete container classes for managing Yellowbrick's bundled datasets and text corpora, supporting numpy and pandas output formats.

Description

The datasets.base module provides three classes: BaseDataset handles downloading and verifying datasets from the hosted store, Dataset adds loading into numpy arrays or pandas DataFrames, and Corpus handles text document collections with label-based organization. Each dataset validates its SHA256 signature on download.

Usage

Import these classes when building custom dataset loading logic or when working directly with Yellowbrick's dataset infrastructure. Most users will use the higher-level loader functions (e.g., load_mushroom), which return Dataset or Corpus objects internally.

Code Reference

Source Location

Signature

class BaseDataset:
    def __init__(self, name, url=None, signature=None, data_home=None):
        """Base functionality for Dataset and Corpus objects."""
    def download(self, replace=False): ...
    def contents(self): ...
    def README(self): ...
    def meta(self): ...
    def citation(self): ...

class Dataset(BaseDataset):
    def to_data(self): ...
    def to_numpy(self): ...
    def to_pandas(self): ...
    def to_dataframe(self): ...

class Corpus(BaseDataset):
    @property
    def root(self): ...
    @property
    def labels(self): ...
    @property
    def files(self): ...
    @property
    def data(self): ...
    @property
    def target(self): ...

Import

from yellowbrick.datasets.base import Dataset, Corpus

I/O Contract

Inputs (Dataset.to_data)

Name Type Required Description
(none) Uses data on disk from download

Outputs (Dataset.to_data)

Name Type Description
X DataFrame or ndarray Feature matrix
y Series or ndarray Target vector

Outputs (Corpus)

Name Type Description
data list of str Document contents
target list of str Document labels
labels list of str Unique label names

Usage Examples

from yellowbrick.datasets.base import Dataset

# Load a dataset by name
ds = Dataset("mushroom")
X, y = ds.to_pandas()
print(X.head())
print(y.value_counts())
from yellowbrick.datasets.base import Corpus

# Load a text corpus
corpus = Corpus("hobbies")
print(corpus.labels)
print(len(corpus.data))

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment