Implementation:Huggingface Datasets Dataset Save To Disk

Knowledge Sources	Huggingface Datasets HF Datasets Docs
Domains	Data_Engineering, NLP
Last Updated	2026-02-14 18:00 GMT

Overview

Concrete tool for saving a HuggingFace Dataset to disk in native Arrow format provided by the HuggingFace Datasets library.

Description

Dataset.save_to_disk is a method that writes the dataset to a directory as one or more Arrow shard files along with JSON metadata files (dataset_info.json and state.json). The saved data can be reloaded with Dataset.load_from_disk for instant zero-copy access via memory mapping. The method supports automatic sharding based on max_shard_size (default 500MB) or an explicit num_shards parameter, multiprocessing for parallel writes, and remote filesystems via fsspec storage options. It preserves the dataset's formatting state, fingerprint, and split metadata.

Usage

Use Dataset.save_to_disk when you want to checkpoint a dataset's state for fast reloading in subsequent sessions, cache expensive preprocessing results, or save to a remote filesystem such as S3 or GCS.

Code Reference

Source Location

Repository: datasets
File: src/datasets/arrow_dataset.py
Lines: L1509-L1598

Signature

def save_to_disk(
    self,
    dataset_path: PathLike,
    max_shard_size: Optional[Union[str, int]] = None,
    num_shards: Optional[int] = None,
    num_proc: Optional[int] = None,
    storage_options: Optional[dict] = None,
):

Import

from datasets import Dataset
# save_to_disk is a method on Dataset instances

I/O Contract

Inputs

Name	Type	Required	Description
dataset_path	`PathLike`	Yes	Local path (e.g., "dataset/train") or remote URI (e.g., "s3://bucket/dataset/train") where the dataset will be saved.
max_shard_size	`Optional[Union[str, int]]`	No	Maximum size of each shard file. Can be a string like "500MB" or an integer in bytes. Mutually exclusive with num_shards.
num_shards	`Optional[int]`	No	Exact number of shards to create. Mutually exclusive with max_shard_size.
num_proc	`Optional[int]`	No	Number of processes for parallel writing.
storage_options	`Optional[dict]`	No	Key/value pairs for the fsspec file-system backend (e.g., S3 credentials).

Outputs

Name	Type	Description
(none)		The method writes files to disk and does not return a value.

Usage Examples

Basic Usage

from datasets import load_dataset

ds = load_dataset("csv", data_files="input.csv", split="train")

# Save to a local directory
ds.save_to_disk("path/to/dataset/directory")

# Save with a maximum shard size
ds.save_to_disk("path/to/dataset/directory", max_shard_size="1GB")

# Save with explicit number of shards and multiprocessing
ds.save_to_disk("path/to/dataset/directory", num_shards=1024, num_proc=8)

# Save to a remote filesystem
ds.save_to_disk("s3://my-bucket/dataset/train", storage_options={"key": "...", "secret": "..."})

# Reload later
from datasets import load_from_disk
ds_reloaded = load_from_disk("path/to/dataset/directory")

Related Pages

Implements Principle

Principle:Huggingface_Datasets_Disk_Persistence

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment