Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Huggingface Datasets Dataset Save To Disk

From Leeroopedia
Revision as of 12:58, 16 February 2026 by Admin (talk | contribs) (Auto-imported from implementations/Huggingface_Datasets_Dataset_Save_To_Disk.md)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Knowledge Sources
Domains Data_Engineering, NLP
Last Updated 2026-02-14 18:00 GMT

Overview

Concrete tool for saving a HuggingFace Dataset to disk in native Arrow format provided by the HuggingFace Datasets library.

Description

Dataset.save_to_disk is a method that writes the dataset to a directory as one or more Arrow shard files along with JSON metadata files (dataset_info.json and state.json). The saved data can be reloaded with Dataset.load_from_disk for instant zero-copy access via memory mapping. The method supports automatic sharding based on max_shard_size (default 500MB) or an explicit num_shards parameter, multiprocessing for parallel writes, and remote filesystems via fsspec storage options. It preserves the dataset's formatting state, fingerprint, and split metadata.

Usage

Use Dataset.save_to_disk when you want to checkpoint a dataset's state for fast reloading in subsequent sessions, cache expensive preprocessing results, or save to a remote filesystem such as S3 or GCS.

Code Reference

Source Location

  • Repository: datasets
  • File: src/datasets/arrow_dataset.py
  • Lines: L1509-L1598

Signature

def save_to_disk(
    self,
    dataset_path: PathLike,
    max_shard_size: Optional[Union[str, int]] = None,
    num_shards: Optional[int] = None,
    num_proc: Optional[int] = None,
    storage_options: Optional[dict] = None,
):

Import

from datasets import Dataset
# save_to_disk is a method on Dataset instances

I/O Contract

Inputs

Name Type Required Description
dataset_path PathLike Yes Local path (e.g., "dataset/train") or remote URI (e.g., "s3://bucket/dataset/train") where the dataset will be saved.
max_shard_size Optional[Union[str, int]] No Maximum size of each shard file. Can be a string like "500MB" or an integer in bytes. Mutually exclusive with num_shards.
num_shards Optional[int] No Exact number of shards to create. Mutually exclusive with max_shard_size.
num_proc Optional[int] No Number of processes for parallel writing.
storage_options Optional[dict] No Key/value pairs for the fsspec file-system backend (e.g., S3 credentials).

Outputs

Name Type Description
(none) The method writes files to disk and does not return a value.

Usage Examples

Basic Usage

from datasets import load_dataset

ds = load_dataset("csv", data_files="input.csv", split="train")

# Save to a local directory
ds.save_to_disk("path/to/dataset/directory")

# Save with a maximum shard size
ds.save_to_disk("path/to/dataset/directory", max_shard_size="1GB")

# Save with explicit number of shards and multiprocessing
ds.save_to_disk("path/to/dataset/directory", num_shards=1024, num_proc=8)

# Save to a remote filesystem
ds.save_to_disk("s3://my-bucket/dataset/train", storage_options={"key": "...", "secret": "..."})

# Reload later
from datasets import load_from_disk
ds_reloaded = load_from_disk("path/to/dataset/directory")

Related Pages

Implements Principle

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment