Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Scikit learn Scikit learn FetchOpenml

From Leeroopedia


Knowledge Sources
Domains Data Loading, Machine Learning
Last Updated 2026-02-08 15:00 GMT

Overview

Concrete tool for fetching datasets from the OpenML repository provided by scikit-learn.

Description

This module implements the fetch_openml function that downloads and loads datasets from the OpenML platform (openml.org). It supports fetching by dataset name or ID, caching downloaded data locally, and returning results as NumPy arrays or pandas DataFrames. The module handles the OpenML REST API for searching datasets, retrieving metadata, features, and data qualities. It includes retry logic with cache clearing for robustness and supports gzip-compressed ARFF file parsing.

Usage

Use this function to access any of the thousands of datasets available on the OpenML platform for machine learning experimentation, benchmarking, and research.

Code Reference

Source Location

Signature

def fetch_openml(
    name: Optional[str] = None,
    *,
    version: Union[str, int] = "active",
    data_id: Optional[int] = None,
    data_home: Optional[Union[str, os.PathLike]] = None,
    target_column: Optional[Union[str, List]] = "default-target",
    cache: bool = True,
    return_X_y: bool = False,
    as_frame: Union[str, bool] = "auto",
    n_retries: int = 3,
    delay: float = 1.0,
    parser: str = "auto",
)

Import

from sklearn.datasets import fetch_openml

I/O Contract

Inputs

Name Type Required Description
name str or None No Name of the dataset on OpenML
version str or int No Version of the dataset: integer or 'active' (default: 'active')
data_id int or None No OpenML dataset ID (alternative to name)
data_home str or PathLike or None No Custom directory for data caching
target_column str, list, or None No Column name(s) for target variable (default: 'default-target')
cache bool No Whether to cache downloaded data (default: True)
return_X_y bool No If True, return (data, target) tuple (default: False)
as_frame str or bool No Return data as pandas DataFrame (default: 'auto')
n_retries int No Number of download retry attempts (default: 3)
parser str No Parser to use: 'auto', 'pandas', or 'liac-arff' (default: 'auto')

Outputs

Name Type Description
data Bunch Dictionary-like with data, target, feature_names, target_names, DESCR, details, url
(X, y) tuple Returned when return_X_y=True; feature data and target data

Usage Examples

Basic Usage

from sklearn.datasets import fetch_openml

# Fetch the iris dataset from OpenML
iris = fetch_openml(name='iris', version=1, as_frame=True)
print("Shape:", iris.data.shape)
print("Features:", iris.feature_names)
print("Target:", iris.target.unique())

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment