Implementation:Scikit learn Scikit learn FetchOpenml

Knowledge Sources	Scikit_learn Scikit-learn Docs
Domains	Data Loading, Machine Learning
Last Updated	2026-02-08 15:00 GMT

Overview

Concrete tool for fetching datasets from the OpenML repository provided by scikit-learn.

Description

This module implements the fetch_openml function that downloads and loads datasets from the OpenML platform (openml.org). It supports fetching by dataset name or ID, caching downloaded data locally, and returning results as NumPy arrays or pandas DataFrames. The module handles the OpenML REST API for searching datasets, retrieving metadata, features, and data qualities. It includes retry logic with cache clearing for robustness and supports gzip-compressed ARFF file parsing.

Usage

Use this function to access any of the thousands of datasets available on the OpenML platform for machine learning experimentation, benchmarking, and research.

Code Reference

Source Location

Repository: scikit-learn
File: sklearn/datasets/_openml.py

Signature

def fetch_openml(
    name: Optional[str] = None,
    *,
    version: Union[str, int] = "active",
    data_id: Optional[int] = None,
    data_home: Optional[Union[str, os.PathLike]] = None,
    target_column: Optional[Union[str, List]] = "default-target",
    cache: bool = True,
    return_X_y: bool = False,
    as_frame: Union[str, bool] = "auto",
    n_retries: int = 3,
    delay: float = 1.0,
    parser: str = "auto",
)

Import

from sklearn.datasets import fetch_openml

I/O Contract

Inputs

Name	Type	Required	Description
name	str or None	No	Name of the dataset on OpenML
version	str or int	No	Version of the dataset: integer or 'active' (default: 'active')
data_id	int or None	No	OpenML dataset ID (alternative to name)
data_home	str or PathLike or None	No	Custom directory for data caching
target_column	str, list, or None	No	Column name(s) for target variable (default: 'default-target')
cache	bool	No	Whether to cache downloaded data (default: True)
return_X_y	bool	No	If True, return (data, target) tuple (default: False)
as_frame	str or bool	No	Return data as pandas DataFrame (default: 'auto')
n_retries	int	No	Number of download retry attempts (default: 3)
parser	str	No	Parser to use: 'auto', 'pandas', or 'liac-arff' (default: 'auto')

Outputs

Name	Type	Description
data	Bunch	Dictionary-like with data, target, feature_names, target_names, DESCR, details, url
(X, y)	tuple	Returned when return_X_y=True; feature data and target data

Usage Examples

Basic Usage

from sklearn.datasets import fetch_openml

# Fetch the iris dataset from OpenML
iris = fetch_openml(name='iris', version=1, as_frame=True)
print("Shape:", iris.data.shape)
print("Features:", iris.feature_names)
print("Target:", iris.target.unique())

Related Pages

Principle:Scikit_learn_Scikit_learn_Dataset_Loading

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment