Implementation:Huggingface Datasets SparkDatasetReader

Knowledge Sources	Huggingface Datasets HF Datasets Docs
Domains	Data_Engineering, NLP
Last Updated	2026-02-14 18:00 GMT

Overview

Concrete tool for importing PySpark DataFrames into the HuggingFace Dataset format provided by the HuggingFace Datasets library.

Description

SparkDatasetReader is a reader class that extends AbstractDatasetReader and uses the packaged Spark builder to convert a PySpark DataFrame into a HuggingFace Dataset or IterableDataset. Cache materialization is parallelized over Spark executors, and an NFS path accessible to the driver must be provided for non-streaming mode. The reader supports configurable features, caching, a working directory for intermediate files, cache reuse control, and a choice of file format (Arrow by default). Streaming mode (the default) returns an IterableDataset.

Usage

Use SparkDatasetReader when you have a PySpark DataFrame and want to convert it to a HuggingFace Dataset. It is typically invoked indirectly via Dataset.from_spark(), but can also be instantiated directly.

Code Reference

Source Location

Repository: datasets
File: src/datasets/io/spark.py
Lines: L11-L57

Signature

class SparkDatasetReader(AbstractDatasetReader):
    def __init__(
        self,
        df: pyspark.sql.DataFrame,
        split: Optional[NamedSplit] = None,
        features: Optional[Features] = None,
        streaming: bool = True,
        cache_dir: str = None,
        keep_in_memory: bool = False,
        working_dir: str = None,
        load_from_cache_file: bool = True,
        file_format: str = "arrow",
        **kwargs,
    ):

    def read(self):

Import

from datasets.io.spark import SparkDatasetReader

I/O Contract

Inputs

Name	Type	Required	Description
df	`pyspark.sql.DataFrame`	Yes	The PySpark DataFrame to convert.
split	`Optional[NamedSplit]`	No	Name of the dataset split to assign.
features	`Optional[Features]`	No	Explicit schema to apply to the resulting dataset.
streaming	`bool`	No	If True (the default), returns an IterableDataset for streaming access.
cache_dir	`str`	No	Directory for caching the processed dataset.
keep_in_memory	`bool`	No	Whether to keep the dataset in memory. Defaults to False.
working_dir	`str`	No	NFS working directory accessible to all Spark executors for intermediate files.
load_from_cache_file	`bool`	No	Whether to load from an existing cache if available. Defaults to True.
file_format	`str`	No	Format for cache files. Defaults to "arrow".
**kwargs		No	Additional keyword arguments forwarded to the Spark builder.

Outputs

Name	Type	Description
dataset	`Dataset` or `IterableDataset`	The loaded dataset, either map-style or iterable depending on the streaming parameter.

Usage Examples

Basic Usage

from datasets.io.spark import SparkDatasetReader

# Convert a PySpark DataFrame to a streaming IterableDataset
reader = SparkDatasetReader(spark_df)
iterable_dataset = reader.read()

# Convert to a map-style Dataset with caching
reader = SparkDatasetReader(
    spark_df,
    streaming=False,
    cache_dir="/mnt/nfs/cache",
    working_dir="/mnt/nfs/working",
)
dataset = reader.read()

Related Pages

Implements Principle

Principle:Huggingface_Datasets_Spark_Import

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment