Implementation:Huggingface Datasets Dataset Save To Disk
| Knowledge Sources | |
|---|---|
| Domains | Data_Engineering, NLP |
| Last Updated | 2026-02-14 18:00 GMT |
Overview
Concrete tool for saving a HuggingFace Dataset to disk in native Arrow format provided by the HuggingFace Datasets library.
Description
Dataset.save_to_disk is a method that writes the dataset to a directory as one or more Arrow shard files along with JSON metadata files (dataset_info.json and state.json). The saved data can be reloaded with Dataset.load_from_disk for instant zero-copy access via memory mapping. The method supports automatic sharding based on max_shard_size (default 500MB) or an explicit num_shards parameter, multiprocessing for parallel writes, and remote filesystems via fsspec storage options. It preserves the dataset's formatting state, fingerprint, and split metadata.
Usage
Use Dataset.save_to_disk when you want to checkpoint a dataset's state for fast reloading in subsequent sessions, cache expensive preprocessing results, or save to a remote filesystem such as S3 or GCS.
Code Reference
Source Location
- Repository: datasets
- File:
src/datasets/arrow_dataset.py - Lines: L1509-L1598
Signature
def save_to_disk(
self,
dataset_path: PathLike,
max_shard_size: Optional[Union[str, int]] = None,
num_shards: Optional[int] = None,
num_proc: Optional[int] = None,
storage_options: Optional[dict] = None,
):
Import
from datasets import Dataset
# save_to_disk is a method on Dataset instances
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| dataset_path | PathLike |
Yes | Local path (e.g., "dataset/train") or remote URI (e.g., "s3://bucket/dataset/train") where the dataset will be saved. |
| max_shard_size | Optional[Union[str, int]] |
No | Maximum size of each shard file. Can be a string like "500MB" or an integer in bytes. Mutually exclusive with num_shards. |
| num_shards | Optional[int] |
No | Exact number of shards to create. Mutually exclusive with max_shard_size. |
| num_proc | Optional[int] |
No | Number of processes for parallel writing. |
| storage_options | Optional[dict] |
No | Key/value pairs for the fsspec file-system backend (e.g., S3 credentials). |
Outputs
| Name | Type | Description |
|---|---|---|
| (none) | The method writes files to disk and does not return a value. |
Usage Examples
Basic Usage
from datasets import load_dataset
ds = load_dataset("csv", data_files="input.csv", split="train")
# Save to a local directory
ds.save_to_disk("path/to/dataset/directory")
# Save with a maximum shard size
ds.save_to_disk("path/to/dataset/directory", max_shard_size="1GB")
# Save with explicit number of shards and multiprocessing
ds.save_to_disk("path/to/dataset/directory", num_shards=1024, num_proc=8)
# Save to a remote filesystem
ds.save_to_disk("s3://my-bucket/dataset/train", storage_options={"key": "...", "secret": "..."})
# Reload later
from datasets import load_from_disk
ds_reloaded = load_from_disk("path/to/dataset/directory")