Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Scikit learn Scikit learn ArffParser

From Leeroopedia


Knowledge Sources
Domains Data Loading, Machine Learning
Last Updated 2026-02-08 15:00 GMT

Overview

Concrete tool for parsing ARFF (Attribute-Relation File Format) data files provided by scikit-learn.

Description

This module implements ARFF parsers used internally by scikit-learn's OpenML data fetcher. It provides functions to parse both dense and sparse ARFF data representations, handling column selection, data type conversions, and sparse-to-dense transformations. The module supports loading ARFF data from gzip-compressed files and can return results as either NumPy arrays or pandas DataFrames.

Usage

Use this module when loading datasets from OpenML or other sources that provide data in ARFF format, particularly through the fetch_openml interface.

Code Reference

Source Location

Signature

def _split_sparse_columns(
    arff_data: ArffSparseDataType, include_columns: List
) -> ArffSparseDataType

def _sparse_data_to_array(
    arff_data: ArffSparseDataType, include_columns: List
) -> np.ndarray

def load_arff_from_gzip_file(...)

Import

from sklearn.datasets._arff_parser import load_arff_from_gzip_file

I/O Contract

Inputs

Name Type Required Description
arff_data tuple Yes Sparse ARFF data as tuple of (values, row_indices, col_indices)
include_columns list Yes List of column indices to include in the output

Outputs

Name Type Description
arff_data_new tuple Filtered sparse ARFF data with re-indexed columns
array np.ndarray Dense array converted from sparse ARFF data
X np.ndarray or DataFrame Feature matrix loaded from ARFF file
y np.ndarray or Series Target variable loaded from ARFF file

Usage Examples

Basic Usage

# Typically used internally by fetch_openml
from sklearn.datasets import fetch_openml

# This internally uses the ARFF parser
data = fetch_openml(name='iris', version=1, as_frame=True)
print(data.data.head())
print(data.target.head())

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment