Principle:Scikit learn Scikit learn Dataset Loading
| Field | Value |
|---|---|
| sources | Paper: Dua, D. and Graff, C. (2019). UCI Machine Learning Repository. Irvine, CA: University of California, School of Information and Computer Science; scikit-learn documentation: https://scikit-learn.org/stable/datasets.html |
| domains | Data_Science, Machine_Learning |
| last_updated | 2026-02-08 15:00 GMT |
Overview
A foundational step that provides structured data for model training and evaluation.
Description
Dataset loading is the process of acquiring and organizing raw data into structured arrays suitable for machine learning algorithms. In scikit-learn, this is accomplished through a unified interface that returns feature matrices (commonly referred to as X) and target vectors (commonly referred to as y) from a variety of sources.
Scikit-learn provides three categories of dataset loading utilities:
- Bundled (toy) datasets -- Small, classic datasets shipped with the library itself (e.g., Iris, Digits, Wine). These require no network access and are available immediately after installation.
- Remote (real-world) datasets -- Larger datasets fetched on demand from external repositories via functions such as
fetch_openmlorfetch_20newsgroups. These are cached locally after the first download. - Synthetic (generated) datasets -- Algorithmically generated datasets created by functions such as
make_classification,make_regression, andmake_blobs. These are useful for controlled experiments where the ground truth is known by construction.
Regardless of the source, the loading functions return data in a consistent format: either a Bunch object (a dictionary-like container with attribute access) or, when return_X_y=True, a simple tuple of (X, y) arrays.
Usage
Use dataset loading when:
- Prototyping and benchmarking -- Bundled datasets offer a quick, reproducible starting point for testing algorithms without worrying about data acquisition or cleaning.
- Teaching and tutorials -- Standard datasets such as Iris and Digits are widely recognized in the literature, making them ideal for educational contexts.
- Controlled experiments -- Synthetic generators allow precise control over dimensionality, class separation, noise level, and other properties.
Use custom data pipelines (e.g., pandas.read_csv, database connectors) when working with domain-specific, production, or proprietary data that is not available through scikit-learn's built-in loaders.
Theoretical Basis
The data structures returned by scikit-learn's loading functions follow a well-established convention:
- Bunch container -- The
sklearn.utils.Bunchclass extends Python'sdictto support attribute-style access. A typical Bunch returned by a loader contains the keysdata,target,feature_names,target_names,DESCR, andframe(whenas_frame=True). - X / y split convention -- The feature matrix X is a two-dimensional array of shape
(n_samples, n_features)where each row is an observation and each column is a measured attribute. The target vector y is a one-dimensional array of shape(n_samples,)holding the labels or values to be predicted. This separation mirrors the mathematical notation used in supervised learning: given a dataset , the goal is to learn a mapping . - Data types -- Feature arrays are typically NumPy
float64ndarrays (or pandas DataFrames whenas_frame=True). Target arrays areint64for classification tasks andfloat64for regression tasks. Sparse matrices (scipy.sparse.csr_matrix) are used for high-dimensional, sparse datasets such as text corpora.