Principle:Huggingface Datasets Disk Persistence
| Knowledge Sources | |
|---|---|
| Domains | Data_Engineering, NLP |
| Last Updated | 2026-02-14 18:00 GMT |
Overview
Disk Persistence is the principle of saving a HuggingFace Dataset to disk in native Arrow format so it can be reloaded later with zero deserialization overhead.
Description
Unlike export formats such as CSV or JSON, saving to disk in Arrow format preserves the full fidelity of the dataset including its schema, features, formatting state, fingerprint, and split metadata. The saved directory contains one or more Arrow shard files along with JSON metadata files (dataset_info.json and state.json). When reloaded with Dataset.load_from_disk, the Arrow files are memory-mapped for instant zero-copy access without any parsing or deserialization. The method supports sharding (controlled by max_shard_size or num_shards), multiprocessing for parallel writes, and remote filesystems via fsspec storage options.
Usage
Use Disk Persistence when you want to save intermediate or final dataset processing results for fast reloading in subsequent sessions. This is the recommended approach for checkpointing large datasets between preprocessing steps or for caching expensive transformations.
Theoretical Basis
Apache Arrow's IPC (Inter-Process Communication) format stores data as a sequence of record batches with a self-describing schema. Writing a Dataset to disk in this format is essentially a memcpy of the in-memory Arrow buffers, and reading it back is a memory-map operation. This provides O(1) deserialization time regardless of dataset size, making it orders of magnitude faster than re-parsing text-based formats. The sharding mechanism splits the dataset into multiple Arrow files to support parallel I/O and to keep individual file sizes manageable.