Implementation:Huggingface Datasets Sharding Utils

Knowledge Sources	Huggingface Datasets HF Datasets Docs
Domains	Distributed_Computing, Data_Processing
Last Updated	2026-02-14 18:00 GMT

Overview

Utilities for distributing dataset generation work across multiple shards and workers.

Description

This module provides functions used internally by the dataset builder infrastructure to split dataset generation work across multiple parallel workers. The core concept is that gen_kwargs -- the keyword arguments passed to a builder's _generate_examples or _generate_tables method -- contain lists that represent data sources (e.g., file paths). These lists can be partitioned across workers for parallel processing.

The module provides five functions:

_number_of_shards_in_gen_kwargs: Determines how many shards are available by examining list-valued entries in gen_kwargs. All lists must have the same length (or an ambiguity error is raised), and non-list values are treated as shared across all shards.

_distribute_shards: Divides num_shards indices into at most max_num_jobs contiguous ranges, distributing work as evenly as possible. Preserves shard ordering.

_split_gen_kwargs: Combines the two above to partition a gen_kwargs dict into a list of per-worker gen_kwargs dicts, where each worker gets a subset of the list-valued entries while non-list values are shared.

_merge_gen_kwargs: The inverse of _split_gen_kwargs -- merges a list of per-worker gen_kwargs dicts back into a single dict by concatenating the list-valued entries.

_shuffle_gen_kwargs: Returns a shuffled copy of gen_kwargs, using a NumPy random generator. Lists of the same size are shuffled with the same permutation, preserving alignment between entangled lists (e.g., file paths and their metadata).

Usage

These functions are used internally by the dataset builder infrastructure (e.g., in DatasetBuilder._prepare_split) and are not typically called directly by end users. They enable multiprocessing during dataset generation by splitting work across workers.

Code Reference

Source Location

Repository: datasets
File: src/datasets/utils/sharding.py
Lines: 1-92

Signature

def _number_of_shards_in_gen_kwargs(gen_kwargs: dict) -> int:
    """Return the number of possible shards according to the input gen_kwargs."""

def _distribute_shards(num_shards: int, max_num_jobs: int) -> list[range]:
    """Get the range of shard indices per job."""

def _split_gen_kwargs(gen_kwargs: dict, max_num_jobs: int) -> list[dict]:
    """Split the gen_kwargs into `max_num_job` gen_kwargs."""

def _merge_gen_kwargs(gen_kwargs_list: list[dict]) -> dict:
    """Merge a list of per-worker gen_kwargs back into a single dict."""

def _shuffle_gen_kwargs(rng: np.random.Generator, gen_kwargs: dict) -> dict:
    """Return a shuffled copy of the input gen_kwargs."""

Import

from datasets.utils.sharding import _distribute_shards, _split_gen_kwargs

I/O Contract

`_number_of_shards_in_gen_kwargs`

Name	Type	Required	Description
gen_kwargs	`dict`	Yes	The generation keyword arguments. List-valued entries represent data sources to be sharded. All lists must have the same length.

Returns: int -- The number of shards (max list length, minimum 1).

Raises: RuntimeError -- If lists of different lengths are found, making sharding ambiguous.

`_distribute_shards`

Name	Type	Required	Description
num_shards	`int`	Yes	Total number of shards to distribute.
max_num_jobs	`int`	Yes	Maximum number of workers/jobs.

Returns: list[range] -- A list of contiguous range objects, one per active job. Each range specifies the shard indices assigned to that job.

Example:

>>> _distribute_shards(2, max_num_jobs=4)
[range(0, 1), range(1, 2)]
>>> _distribute_shards(10, max_num_jobs=3)
[range(0, 4), range(4, 7), range(7, 10)]

`_split_gen_kwargs`

Name	Type	Required	Description
gen_kwargs	`dict`	Yes	The generation keyword arguments to split.
max_num_jobs	`int`	Yes	Maximum number of parallel workers.

Returns: list[dict] -- A list of gen_kwargs dicts, one per worker. List-valued entries are partitioned; non-list values are shared across all workers.

`_merge_gen_kwargs`

Name	Type	Required	Description
gen_kwargs_list	`list[dict]`	Yes	A list of per-worker `gen_kwargs` dicts to merge.

Returns: dict -- A single merged gen_kwargs dict with list entries concatenated.

`_shuffle_gen_kwargs`

Name	Type	Required	Description
rng	`np.random.Generator`	Yes	A NumPy random number generator for reproducible shuffling.
gen_kwargs	`dict`	Yes	The generation keyword arguments to shuffle.

Returns: dict -- A new dict with list-valued entries shuffled. Lists of the same size receive the same permutation to maintain alignment.

Usage Examples

Splitting Work Across Workers

from datasets.utils.sharding import _split_gen_kwargs

gen_kwargs = {
    "filepaths": ["shard_0.jsonl", "shard_1.jsonl", "shard_2.jsonl", "shard_3.jsonl"],
    "split": "train",  # non-list value shared across all workers
}

# Split across 2 workers
per_worker = _split_gen_kwargs(gen_kwargs, max_num_jobs=2)
print(per_worker[0])  # {"filepaths": ["shard_0.jsonl", "shard_1.jsonl"], "split": "train"}
print(per_worker[1])  # {"filepaths": ["shard_2.jsonl", "shard_3.jsonl"], "split": "train"}

Distributing Shards

from datasets.utils.sharding import _distribute_shards

# 10 shards across 3 workers
ranges = _distribute_shards(10, max_num_jobs=3)
print(ranges)  # [range(0, 4), range(4, 7), range(7, 10)]

# More workers than shards
ranges = _distribute_shards(2, max_num_jobs=4)
print(ranges)  # [range(0, 1), range(1, 2)]

Shuffling with Aligned Lists

import numpy as np
from datasets.utils.sharding import _shuffle_gen_kwargs

rng = np.random.default_rng(42)
gen_kwargs = {
    "filepaths": ["a.jsonl", "b.jsonl", "c.jsonl"],
    "metadata": ["meta_a", "meta_b", "meta_c"],  # same length, stays aligned
    "split": "train",
}

shuffled = _shuffle_gen_kwargs(rng, gen_kwargs)
# filepaths and metadata are shuffled with the same permutation
# split remains "train"

Related Pages

Implements Principle

Principle:Huggingface_Datasets_Work_Sharding

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment