Implementation:Scikit learn Scikit learn ArffParser
| Knowledge Sources | |
|---|---|
| Domains | Data Loading, Machine Learning |
| Last Updated | 2026-02-08 15:00 GMT |
Overview
Concrete tool for parsing ARFF (Attribute-Relation File Format) data files provided by scikit-learn.
Description
This module implements ARFF parsers used internally by scikit-learn's OpenML data fetcher. It provides functions to parse both dense and sparse ARFF data representations, handling column selection, data type conversions, and sparse-to-dense transformations. The module supports loading ARFF data from gzip-compressed files and can return results as either NumPy arrays or pandas DataFrames.
Usage
Use this module when loading datasets from OpenML or other sources that provide data in ARFF format, particularly through the fetch_openml interface.
Code Reference
Source Location
- Repository: scikit-learn
- File: sklearn/datasets/_arff_parser.py
Signature
def _split_sparse_columns(
arff_data: ArffSparseDataType, include_columns: List
) -> ArffSparseDataType
def _sparse_data_to_array(
arff_data: ArffSparseDataType, include_columns: List
) -> np.ndarray
def load_arff_from_gzip_file(...)
Import
from sklearn.datasets._arff_parser import load_arff_from_gzip_file
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| arff_data | tuple | Yes | Sparse ARFF data as tuple of (values, row_indices, col_indices) |
| include_columns | list | Yes | List of column indices to include in the output |
Outputs
| Name | Type | Description |
|---|---|---|
| arff_data_new | tuple | Filtered sparse ARFF data with re-indexed columns |
| array | np.ndarray | Dense array converted from sparse ARFF data |
| X | np.ndarray or DataFrame | Feature matrix loaded from ARFF file |
| y | np.ndarray or Series | Target variable loaded from ARFF file |
Usage Examples
Basic Usage
# Typically used internally by fetch_openml
from sklearn.datasets import fetch_openml
# This internally uses the ARFF parser
data = fetch_openml(name='iris', version=1, as_frame=True)
print(data.data.head())
print(data.target.head())