Implementation:Scikit learn Scikit learn FetchOpenml
| Knowledge Sources | |
|---|---|
| Domains | Data Loading, Machine Learning |
| Last Updated | 2026-02-08 15:00 GMT |
Overview
Concrete tool for fetching datasets from the OpenML repository provided by scikit-learn.
Description
This module implements the fetch_openml function that downloads and loads datasets from the OpenML platform (openml.org). It supports fetching by dataset name or ID, caching downloaded data locally, and returning results as NumPy arrays or pandas DataFrames. The module handles the OpenML REST API for searching datasets, retrieving metadata, features, and data qualities. It includes retry logic with cache clearing for robustness and supports gzip-compressed ARFF file parsing.
Usage
Use this function to access any of the thousands of datasets available on the OpenML platform for machine learning experimentation, benchmarking, and research.
Code Reference
Source Location
- Repository: scikit-learn
- File: sklearn/datasets/_openml.py
Signature
def fetch_openml(
name: Optional[str] = None,
*,
version: Union[str, int] = "active",
data_id: Optional[int] = None,
data_home: Optional[Union[str, os.PathLike]] = None,
target_column: Optional[Union[str, List]] = "default-target",
cache: bool = True,
return_X_y: bool = False,
as_frame: Union[str, bool] = "auto",
n_retries: int = 3,
delay: float = 1.0,
parser: str = "auto",
)
Import
from sklearn.datasets import fetch_openml
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| name | str or None | No | Name of the dataset on OpenML |
| version | str or int | No | Version of the dataset: integer or 'active' (default: 'active') |
| data_id | int or None | No | OpenML dataset ID (alternative to name) |
| data_home | str or PathLike or None | No | Custom directory for data caching |
| target_column | str, list, or None | No | Column name(s) for target variable (default: 'default-target') |
| cache | bool | No | Whether to cache downloaded data (default: True) |
| return_X_y | bool | No | If True, return (data, target) tuple (default: False) |
| as_frame | str or bool | No | Return data as pandas DataFrame (default: 'auto') |
| n_retries | int | No | Number of download retry attempts (default: 3) |
| parser | str | No | Parser to use: 'auto', 'pandas', or 'liac-arff' (default: 'auto') |
Outputs
| Name | Type | Description |
|---|---|---|
| data | Bunch | Dictionary-like with data, target, feature_names, target_names, DESCR, details, url |
| (X, y) | tuple | Returned when return_X_y=True; feature data and target data |
Usage Examples
Basic Usage
from sklearn.datasets import fetch_openml
# Fetch the iris dataset from OpenML
iris = fetch_openml(name='iris', version=1, as_frame=True)
print("Shape:", iris.data.shape)
print("Features:", iris.feature_names)
print("Target:", iris.target.unique())