Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Heuristic:Huggingface Datasets Parquet Shard Sizing

From Leeroopedia

Overview

Parquet file shard and row group sizing guidelines for optimal Hub hosting and query performance in HuggingFace Datasets.

Description

When HuggingFace Datasets writes Parquet files -- whether during push_to_hub, save_to_disk, or download_and_prepare -- two sizing parameters govern the physical layout of the output:

  • Row group size (MAX_ROW_GROUP_SIZE): Controls how many rows are packed into a single Parquet row group. A row group is the atomic read unit of Parquet; reading even a single row requires loading the entire row group into memory. The default target is 100 MB of uncompressed data per row group.
  • Shard size (MAX_SHARD_SIZE): Controls the maximum size of each individual Parquet file (shard). When a dataset exceeds this threshold, it is split into multiple shard files. The default target is 500 MB of uncompressed data per shard.

These two parameters work together: each shard file contains one or more row groups, and the library calculates the appropriate writer_batch_size to hit the row group target based on the ratio of rows to bytes in the dataset.

Usage

These defaults apply automatically whenever you:

  • Call Dataset.push_to_hub() or DatasetDict.push_to_hub() to upload a dataset to the HuggingFace Hub.
  • Call DatasetBuilder.download_and_prepare() to materialize a dataset locally (when output format is Parquet).
  • Override via the max_shard_size parameter on push_to_hub (e.g., max_shard_size="1GB").
  • Modify global defaults by setting datasets.config.MAX_SHARD_SIZE or datasets.config.MAX_ROW_GROUP_SIZE before writing.

The Insight (Rule of Thumb)

  • Action: Use the default shard size of 500 MB and row group size of 100 MB unless you have specific reasons to change them.
  • Value: MAX_SHARD_SIZE = "500MB", MAX_ROW_GROUP_SIZE = "100MB" (uncompressed).
  • Trade-off: Larger shards mean fewer files but longer individual downloads and higher peak memory during streaming. Larger row groups mean fewer metadata lookups but more data read per random access of a single row.

Reasoning

Parquet row groups are the atomic read unit of the format. Accessing a single row requires reading its entire row group into memory. This creates a fundamental tension:

  • Row groups too small: Excessive metadata overhead. Each row group carries its own column chunk metadata, page indexes, and statistics. Many tiny row groups inflate the Parquet footer and increase the number of I/O operations needed for columnar scans.
  • Row groups too large: Wasteful reads for single-row access. If a row group is 1 GB and a consumer only needs one row, 1 GB must be read and decompressed. This is especially painful for the HuggingFace Hub dataset viewer, which serves preview rows to users on demand.

The 100 MB target is a pragmatic middle ground that:

  1. Keeps random access cost bounded -- reading one row never requires more than ~100 MB of I/O.
  2. Matches the expectations of the HuggingFace Hub dataset viewer, which relies on row-group-level access.
  3. Maintains reasonable metadata-to-data ratios for columnar query engines.

Similarly, the 500 MB shard size:

  1. Produces files large enough to amortize HTTP overhead when downloading from the Hub.
  2. Keeps individual files small enough that streaming consumers do not need to buffer excessive data in memory.
  3. Provides natural parallelism boundaries -- multiple shards can be downloaded and processed concurrently.

The number of shards is computed as int(dataset_nbytes / max_shard_size) + 1, ensuring at least one shard exists.

Code Evidence

config.py (lines 193-197)

Global default constants:

# Max uncompressed shard size in bytes (e.g. to shard parquet datasets in push_to_hub or download_and_prepare)
MAX_SHARD_SIZE = "500MB"

# Max uncompressed row group size in bytes (e.g. for parquet files in push_to_hub or download_and_prepare)
MAX_ROW_GROUP_SIZE = "100MB"

Source: src/datasets/config.py, lines 193-197.

arrow_writer.py (lines 133-154)

Row group size calculation with explanatory comments:

def get_writer_batch_size_from_data_size(num_rows: int, num_bytes: int) -> int:
    """
    Get the writer_batch_size that defines the maximum row group size in the parquet files.
    The default in `datasets` is aiming for row groups of maximum 100MB uncompressed.
    This allows to optimize random access to parquet file, since accessing 1 row requires
    to read its entire row group.

    This can be improved to get optimized size for querying/iterating
    but at least it matches the dataset viewer expectations on HF.

    Args:
        num_rows (`int`):
            Number of rows in the dataset.
        num_bytes (`int`):
            Number of bytes in the dataset.
            For dataset with external files to embed (image, audio, videos), this can also be an
            estimate from `dataset._estimate_nbytes()`.
    Returns:
        writer_batch_size (`Optional[int]`):
            Writer batch size to pass to a parquet writer.
    """
    return max(1, num_rows * convert_file_size_to_int(config.MAX_ROW_GROUP_SIZE) // num_bytes) if num_bytes > 0 else 1

Source: src/datasets/arrow_writer.py, lines 133-154.

arrow_dataset.py -- push_to_hub shard calculation (lines 5611-5614)

Shard count derivation from max_shard_size:

if num_shards is None:
    max_shard_size = convert_file_size_to_int(max_shard_size or config.MAX_SHARD_SIZE)
    num_shards = int(dataset_nbytes / max_shard_size) + 1
    num_shards = max(num_shards, num_proc or 1)

Source: src/datasets/arrow_dataset.py, lines 5611-5614.

arrow_dataset.py -- push_to_hub signature (lines 5675-5728)

The max_shard_size parameter on the public API:

max_shard_size (`int` or `str`, *optional*, defaults to `"500MB"`):
    The maximum size of the dataset shards to be uploaded to the hub. If expressed as a string,
    needs to be digits followed by a unit (like `"5MB"`).

Source: src/datasets/arrow_dataset.py, lines 5726-5728.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment