Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Huggingface Datasets SparkDatasetReader

From Leeroopedia
Knowledge Sources
Domains Data_Engineering, NLP
Last Updated 2026-02-14 18:00 GMT

Overview

Concrete tool for importing PySpark DataFrames into the HuggingFace Dataset format provided by the HuggingFace Datasets library.

Description

SparkDatasetReader is a reader class that extends AbstractDatasetReader and uses the packaged Spark builder to convert a PySpark DataFrame into a HuggingFace Dataset or IterableDataset. Cache materialization is parallelized over Spark executors, and an NFS path accessible to the driver must be provided for non-streaming mode. The reader supports configurable features, caching, a working directory for intermediate files, cache reuse control, and a choice of file format (Arrow by default). Streaming mode (the default) returns an IterableDataset.

Usage

Use SparkDatasetReader when you have a PySpark DataFrame and want to convert it to a HuggingFace Dataset. It is typically invoked indirectly via Dataset.from_spark(), but can also be instantiated directly.

Code Reference

Source Location

  • Repository: datasets
  • File: src/datasets/io/spark.py
  • Lines: L11-L57

Signature

class SparkDatasetReader(AbstractDatasetReader):
    def __init__(
        self,
        df: pyspark.sql.DataFrame,
        split: Optional[NamedSplit] = None,
        features: Optional[Features] = None,
        streaming: bool = True,
        cache_dir: str = None,
        keep_in_memory: bool = False,
        working_dir: str = None,
        load_from_cache_file: bool = True,
        file_format: str = "arrow",
        **kwargs,
    ):

    def read(self):

Import

from datasets.io.spark import SparkDatasetReader

I/O Contract

Inputs

Name Type Required Description
df pyspark.sql.DataFrame Yes The PySpark DataFrame to convert.
split Optional[NamedSplit] No Name of the dataset split to assign.
features Optional[Features] No Explicit schema to apply to the resulting dataset.
streaming bool No If True (the default), returns an IterableDataset for streaming access.
cache_dir str No Directory for caching the processed dataset.
keep_in_memory bool No Whether to keep the dataset in memory. Defaults to False.
working_dir str No NFS working directory accessible to all Spark executors for intermediate files.
load_from_cache_file bool No Whether to load from an existing cache if available. Defaults to True.
file_format str No Format for cache files. Defaults to "arrow".
**kwargs No Additional keyword arguments forwarded to the Spark builder.

Outputs

Name Type Description
dataset Dataset or IterableDataset The loaded dataset, either map-style or iterable depending on the streaming parameter.

Usage Examples

Basic Usage

from datasets.io.spark import SparkDatasetReader

# Convert a PySpark DataFrame to a streaming IterableDataset
reader = SparkDatasetReader(spark_df)
iterable_dataset = reader.read()

# Convert to a map-style Dataset with caching
reader = SparkDatasetReader(
    spark_df,
    streaming=False,
    cache_dir="/mnt/nfs/cache",
    working_dir="/mnt/nfs/working",
)
dataset = reader.read()

Related Pages

Implements Principle

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment