Implementation:Huggingface Datasets SparkDatasetReader
| Knowledge Sources | |
|---|---|
| Domains | Data_Engineering, NLP |
| Last Updated | 2026-02-14 18:00 GMT |
Overview
Concrete tool for importing PySpark DataFrames into the HuggingFace Dataset format provided by the HuggingFace Datasets library.
Description
SparkDatasetReader is a reader class that extends AbstractDatasetReader and uses the packaged Spark builder to convert a PySpark DataFrame into a HuggingFace Dataset or IterableDataset. Cache materialization is parallelized over Spark executors, and an NFS path accessible to the driver must be provided for non-streaming mode. The reader supports configurable features, caching, a working directory for intermediate files, cache reuse control, and a choice of file format (Arrow by default). Streaming mode (the default) returns an IterableDataset.
Usage
Use SparkDatasetReader when you have a PySpark DataFrame and want to convert it to a HuggingFace Dataset. It is typically invoked indirectly via Dataset.from_spark(), but can also be instantiated directly.
Code Reference
Source Location
- Repository: datasets
- File:
src/datasets/io/spark.py - Lines: L11-L57
Signature
class SparkDatasetReader(AbstractDatasetReader):
def __init__(
self,
df: pyspark.sql.DataFrame,
split: Optional[NamedSplit] = None,
features: Optional[Features] = None,
streaming: bool = True,
cache_dir: str = None,
keep_in_memory: bool = False,
working_dir: str = None,
load_from_cache_file: bool = True,
file_format: str = "arrow",
**kwargs,
):
def read(self):
Import
from datasets.io.spark import SparkDatasetReader
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| df | pyspark.sql.DataFrame |
Yes | The PySpark DataFrame to convert. |
| split | Optional[NamedSplit] |
No | Name of the dataset split to assign. |
| features | Optional[Features] |
No | Explicit schema to apply to the resulting dataset. |
| streaming | bool |
No | If True (the default), returns an IterableDataset for streaming access. |
| cache_dir | str |
No | Directory for caching the processed dataset. |
| keep_in_memory | bool |
No | Whether to keep the dataset in memory. Defaults to False. |
| working_dir | str |
No | NFS working directory accessible to all Spark executors for intermediate files. |
| load_from_cache_file | bool |
No | Whether to load from an existing cache if available. Defaults to True. |
| file_format | str |
No | Format for cache files. Defaults to "arrow". |
| **kwargs | No | Additional keyword arguments forwarded to the Spark builder. |
Outputs
| Name | Type | Description |
|---|---|---|
| dataset | Dataset or IterableDataset |
The loaded dataset, either map-style or iterable depending on the streaming parameter. |
Usage Examples
Basic Usage
from datasets.io.spark import SparkDatasetReader
# Convert a PySpark DataFrame to a streaming IterableDataset
reader = SparkDatasetReader(spark_df)
iterable_dataset = reader.read()
# Convert to a map-style Dataset with caching
reader = SparkDatasetReader(
spark_df,
streaming=False,
cache_dir="/mnt/nfs/cache",
working_dir="/mnt/nfs/working",
)
dataset = reader.read()