Implementation:DistrictDataLabs Yellowbrick Dataset Containers
| Knowledge Sources | |
|---|---|
| Domains | Data_Science, Datasets |
| Last Updated | 2026-02-08 05:00 GMT |
Overview
Concrete container classes for managing Yellowbrick's bundled datasets and text corpora, supporting numpy and pandas output formats.
Description
The datasets.base module provides three classes: BaseDataset handles downloading and verifying datasets from the hosted store, Dataset adds loading into numpy arrays or pandas DataFrames, and Corpus handles text document collections with label-based organization. Each dataset validates its SHA256 signature on download.
Usage
Import these classes when building custom dataset loading logic or when working directly with Yellowbrick's dataset infrastructure. Most users will use the higher-level loader functions (e.g., load_mushroom), which return Dataset or Corpus objects internally.
Code Reference
Source Location
- Repository: DistrictDataLabs_Yellowbrick
- File: yellowbrick/datasets/base.py
- Lines: 1-334
Signature
class BaseDataset:
def __init__(self, name, url=None, signature=None, data_home=None):
"""Base functionality for Dataset and Corpus objects."""
def download(self, replace=False): ...
def contents(self): ...
def README(self): ...
def meta(self): ...
def citation(self): ...
class Dataset(BaseDataset):
def to_data(self): ...
def to_numpy(self): ...
def to_pandas(self): ...
def to_dataframe(self): ...
class Corpus(BaseDataset):
@property
def root(self): ...
@property
def labels(self): ...
@property
def files(self): ...
@property
def data(self): ...
@property
def target(self): ...
Import
from yellowbrick.datasets.base import Dataset, Corpus
I/O Contract
Inputs (Dataset.to_data)
| Name | Type | Required | Description |
|---|---|---|---|
| (none) | — | — | Uses data on disk from download |
Outputs (Dataset.to_data)
| Name | Type | Description |
|---|---|---|
| X | DataFrame or ndarray | Feature matrix |
| y | Series or ndarray | Target vector |
Outputs (Corpus)
| Name | Type | Description |
|---|---|---|
| data | list of str | Document contents |
| target | list of str | Document labels |
| labels | list of str | Unique label names |
Usage Examples
from yellowbrick.datasets.base import Dataset
# Load a dataset by name
ds = Dataset("mushroom")
X, y = ds.to_pandas()
print(X.head())
print(y.value_counts())
from yellowbrick.datasets.base import Corpus
# Load a text corpus
corpus = Corpus("hobbies")
print(corpus.labels)
print(len(corpus.data))