Implementation:Scikit learn Scikit learn DatasetsModule
| Knowledge Sources | |
|---|---|
| Domains | Machine Learning, Data Loading |
| Last Updated | 2026-02-08 15:00 GMT |
Overview
Concrete tool for loading popular datasets and generating artificial data, provided by scikit-learn.
Description
The sklearn.datasets module is the central namespace that aggregates all dataset loading and generation utilities in scikit-learn. It re-exports functions for loading bundled toy datasets (iris, digits, wine, breast cancer, diabetes, linnerud), fetching remote datasets (California housing, covtype, Olivetti faces, 20 newsgroups, species distributions, OpenML), and generating synthetic datasets (blobs, classification, regression, moons, circles, Swiss roll).
Usage
Use this module whenever you need to load a standard dataset for benchmarking, testing, or prototyping machine learning models. It provides a unified interface for both local bundled datasets and remotely fetched datasets.
Code Reference
Source Location
- Repository: scikit-learn
- File: sklearn/datasets/__init__.py
Signature
# Module-level imports (selected):
from sklearn.datasets._base import load_iris, load_digits, load_wine, load_breast_cancer
from sklearn.datasets._california_housing import fetch_california_housing
from sklearn.datasets._covtype import fetch_covtype
from sklearn.datasets._samples_generator import make_classification, make_regression, make_blobs
Import
from sklearn import datasets
from sklearn.datasets import load_iris, fetch_california_housing, make_classification
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| return_X_y | bool | No | If True, returns (data, target) tuple instead of Bunch object |
| as_frame | bool | No | If True, returns data as pandas DataFrame |
| data_home | str or None | No | Custom directory for caching downloaded datasets |
Outputs
| Name | Type | Description |
|---|---|---|
| dataset | Bunch | Dictionary-like object with data, target, feature_names, and other metadata |
| (X, y) | tuple | Feature matrix and target array when return_X_y=True |
Usage Examples
Basic Usage
from sklearn.datasets import load_iris, make_classification
# Load a bundled dataset
iris = load_iris()
print(iris.data.shape) # (150, 4)
# Generate synthetic data
X, y = make_classification(n_samples=1000, n_features=20, random_state=42)
print(X.shape) # (1000, 20)