Heuristic:Huggingface Datasets Parquet Shard Sizing
Overview
Parquet file shard and row group sizing guidelines for optimal Hub hosting and query performance in HuggingFace Datasets.
Description
When HuggingFace Datasets writes Parquet files -- whether during push_to_hub, save_to_disk, or download_and_prepare -- two sizing parameters govern the physical layout of the output:
- Row group size (
MAX_ROW_GROUP_SIZE): Controls how many rows are packed into a single Parquet row group. A row group is the atomic read unit of Parquet; reading even a single row requires loading the entire row group into memory. The default target is 100 MB of uncompressed data per row group.
- Shard size (
MAX_SHARD_SIZE): Controls the maximum size of each individual Parquet file (shard). When a dataset exceeds this threshold, it is split into multiple shard files. The default target is 500 MB of uncompressed data per shard.
These two parameters work together: each shard file contains one or more row groups, and the library calculates the appropriate writer_batch_size to hit the row group target based on the ratio of rows to bytes in the dataset.
Usage
These defaults apply automatically whenever you:
- Call
Dataset.push_to_hub()orDatasetDict.push_to_hub()to upload a dataset to the HuggingFace Hub. - Call
DatasetBuilder.download_and_prepare()to materialize a dataset locally (when output format is Parquet). - Override via the
max_shard_sizeparameter onpush_to_hub(e.g.,max_shard_size="1GB"). - Modify global defaults by setting
datasets.config.MAX_SHARD_SIZEordatasets.config.MAX_ROW_GROUP_SIZEbefore writing.
The Insight (Rule of Thumb)
- Action: Use the default shard size of 500 MB and row group size of 100 MB unless you have specific reasons to change them.
- Value:
MAX_SHARD_SIZE = "500MB",MAX_ROW_GROUP_SIZE = "100MB"(uncompressed). - Trade-off: Larger shards mean fewer files but longer individual downloads and higher peak memory during streaming. Larger row groups mean fewer metadata lookups but more data read per random access of a single row.
Reasoning
Parquet row groups are the atomic read unit of the format. Accessing a single row requires reading its entire row group into memory. This creates a fundamental tension:
- Row groups too small: Excessive metadata overhead. Each row group carries its own column chunk metadata, page indexes, and statistics. Many tiny row groups inflate the Parquet footer and increase the number of I/O operations needed for columnar scans.
- Row groups too large: Wasteful reads for single-row access. If a row group is 1 GB and a consumer only needs one row, 1 GB must be read and decompressed. This is especially painful for the HuggingFace Hub dataset viewer, which serves preview rows to users on demand.
The 100 MB target is a pragmatic middle ground that:
- Keeps random access cost bounded -- reading one row never requires more than ~100 MB of I/O.
- Matches the expectations of the HuggingFace Hub dataset viewer, which relies on row-group-level access.
- Maintains reasonable metadata-to-data ratios for columnar query engines.
Similarly, the 500 MB shard size:
- Produces files large enough to amortize HTTP overhead when downloading from the Hub.
- Keeps individual files small enough that streaming consumers do not need to buffer excessive data in memory.
- Provides natural parallelism boundaries -- multiple shards can be downloaded and processed concurrently.
The number of shards is computed as int(dataset_nbytes / max_shard_size) + 1, ensuring at least one shard exists.
Code Evidence
config.py (lines 193-197)
Global default constants:
# Max uncompressed shard size in bytes (e.g. to shard parquet datasets in push_to_hub or download_and_prepare) MAX_SHARD_SIZE = "500MB" # Max uncompressed row group size in bytes (e.g. for parquet files in push_to_hub or download_and_prepare) MAX_ROW_GROUP_SIZE = "100MB"
Source: src/datasets/config.py, lines 193-197.
arrow_writer.py (lines 133-154)
Row group size calculation with explanatory comments:
def get_writer_batch_size_from_data_size(num_rows: int, num_bytes: int) -> int:
"""
Get the writer_batch_size that defines the maximum row group size in the parquet files.
The default in `datasets` is aiming for row groups of maximum 100MB uncompressed.
This allows to optimize random access to parquet file, since accessing 1 row requires
to read its entire row group.
This can be improved to get optimized size for querying/iterating
but at least it matches the dataset viewer expectations on HF.
Args:
num_rows (`int`):
Number of rows in the dataset.
num_bytes (`int`):
Number of bytes in the dataset.
For dataset with external files to embed (image, audio, videos), this can also be an
estimate from `dataset._estimate_nbytes()`.
Returns:
writer_batch_size (`Optional[int]`):
Writer batch size to pass to a parquet writer.
"""
return max(1, num_rows * convert_file_size_to_int(config.MAX_ROW_GROUP_SIZE) // num_bytes) if num_bytes > 0 else 1
Source: src/datasets/arrow_writer.py, lines 133-154.
arrow_dataset.py -- push_to_hub shard calculation (lines 5611-5614)
Shard count derivation from max_shard_size:
if num_shards is None:
max_shard_size = convert_file_size_to_int(max_shard_size or config.MAX_SHARD_SIZE)
num_shards = int(dataset_nbytes / max_shard_size) + 1
num_shards = max(num_shards, num_proc or 1)
Source: src/datasets/arrow_dataset.py, lines 5611-5614.
arrow_dataset.py -- push_to_hub signature (lines 5675-5728)
The max_shard_size parameter on the public API:
max_shard_size (`int` or `str`, *optional*, defaults to `"500MB"`):
The maximum size of the dataset shards to be uploaded to the hub. If expressed as a string,
needs to be digits followed by a unit (like `"5MB"`).
Source: src/datasets/arrow_dataset.py, lines 5726-5728.