Implementation:Scikit learn Scikit learn ArffParser

Knowledge Sources	Scikit_learn Scikit-learn Docs
Domains	Data Loading, Machine Learning
Last Updated	2026-02-08 15:00 GMT

Overview

Concrete tool for parsing ARFF (Attribute-Relation File Format) data files provided by scikit-learn.

Description

This module implements ARFF parsers used internally by scikit-learn's OpenML data fetcher. It provides functions to parse both dense and sparse ARFF data representations, handling column selection, data type conversions, and sparse-to-dense transformations. The module supports loading ARFF data from gzip-compressed files and can return results as either NumPy arrays or pandas DataFrames.

Usage

Use this module when loading datasets from OpenML or other sources that provide data in ARFF format, particularly through the fetch_openml interface.

Code Reference

Source Location

Repository: scikit-learn
File: sklearn/datasets/_arff_parser.py

Signature

def _split_sparse_columns(
    arff_data: ArffSparseDataType, include_columns: List
) -> ArffSparseDataType

def _sparse_data_to_array(
    arff_data: ArffSparseDataType, include_columns: List
) -> np.ndarray

def load_arff_from_gzip_file(...)

Import

from sklearn.datasets._arff_parser import load_arff_from_gzip_file

I/O Contract

Inputs

Name	Type	Required	Description
arff_data	tuple	Yes	Sparse ARFF data as tuple of (values, row_indices, col_indices)
include_columns	list	Yes	List of column indices to include in the output

Outputs

Name	Type	Description
arff_data_new	tuple	Filtered sparse ARFF data with re-indexed columns
array	np.ndarray	Dense array converted from sparse ARFF data
X	np.ndarray or DataFrame	Feature matrix loaded from ARFF file
y	np.ndarray or Series	Target variable loaded from ARFF file

Usage Examples

Basic Usage

# Typically used internally by fetch_openml
from sklearn.datasets import fetch_openml

# This internally uses the ARFF parser
data = fetch_openml(name='iris', version=1, as_frame=True)
print(data.data.head())
print(data.target.head())

Related Pages

Principle:Scikit_learn_Scikit_learn_Dataset_Loading

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment