Implementation:Huggingface Datasets Sharding Utils
| Knowledge Sources | |
|---|---|
| Domains | Distributed_Computing, Data_Processing |
| Last Updated | 2026-02-14 18:00 GMT |
Overview
Utilities for distributing dataset generation work across multiple shards and workers.
Description
This module provides functions used internally by the dataset builder infrastructure to split dataset generation work across multiple parallel workers. The core concept is that gen_kwargs -- the keyword arguments passed to a builder's _generate_examples or _generate_tables method -- contain lists that represent data sources (e.g., file paths). These lists can be partitioned across workers for parallel processing.
The module provides five functions:
_number_of_shards_in_gen_kwargs: Determines how many shards are available by examining list-valued entries ingen_kwargs. All lists must have the same length (or an ambiguity error is raised), and non-list values are treated as shared across all shards.
_distribute_shards: Dividesnum_shardsindices into at mostmax_num_jobscontiguous ranges, distributing work as evenly as possible. Preserves shard ordering.
_split_gen_kwargs: Combines the two above to partition agen_kwargsdict into a list of per-workergen_kwargsdicts, where each worker gets a subset of the list-valued entries while non-list values are shared.
_merge_gen_kwargs: The inverse of_split_gen_kwargs-- merges a list of per-workergen_kwargsdicts back into a single dict by concatenating the list-valued entries.
_shuffle_gen_kwargs: Returns a shuffled copy ofgen_kwargs, using a NumPy random generator. Lists of the same size are shuffled with the same permutation, preserving alignment between entangled lists (e.g., file paths and their metadata).
Usage
These functions are used internally by the dataset builder infrastructure (e.g., in DatasetBuilder._prepare_split) and are not typically called directly by end users. They enable multiprocessing during dataset generation by splitting work across workers.
Code Reference
Source Location
- Repository: datasets
- File:
src/datasets/utils/sharding.py - Lines: 1-92
Signature
def _number_of_shards_in_gen_kwargs(gen_kwargs: dict) -> int:
"""Return the number of possible shards according to the input gen_kwargs."""
def _distribute_shards(num_shards: int, max_num_jobs: int) -> list[range]:
"""Get the range of shard indices per job."""
def _split_gen_kwargs(gen_kwargs: dict, max_num_jobs: int) -> list[dict]:
"""Split the gen_kwargs into `max_num_job` gen_kwargs."""
def _merge_gen_kwargs(gen_kwargs_list: list[dict]) -> dict:
"""Merge a list of per-worker gen_kwargs back into a single dict."""
def _shuffle_gen_kwargs(rng: np.random.Generator, gen_kwargs: dict) -> dict:
"""Return a shuffled copy of the input gen_kwargs."""
Import
from datasets.utils.sharding import _distribute_shards, _split_gen_kwargs
I/O Contract
_number_of_shards_in_gen_kwargs
| Name | Type | Required | Description |
|---|---|---|---|
| gen_kwargs | dict |
Yes | The generation keyword arguments. List-valued entries represent data sources to be sharded. All lists must have the same length. |
Returns: int -- The number of shards (max list length, minimum 1).
Raises: RuntimeError -- If lists of different lengths are found, making sharding ambiguous.
_distribute_shards
| Name | Type | Required | Description |
|---|---|---|---|
| num_shards | int |
Yes | Total number of shards to distribute. |
| max_num_jobs | int |
Yes | Maximum number of workers/jobs. |
Returns: list[range] -- A list of contiguous range objects, one per active job. Each range specifies the shard indices assigned to that job.
Example:
>>> _distribute_shards(2, max_num_jobs=4)
[range(0, 1), range(1, 2)]
>>> _distribute_shards(10, max_num_jobs=3)
[range(0, 4), range(4, 7), range(7, 10)]
_split_gen_kwargs
| Name | Type | Required | Description |
|---|---|---|---|
| gen_kwargs | dict |
Yes | The generation keyword arguments to split. |
| max_num_jobs | int |
Yes | Maximum number of parallel workers. |
Returns: list[dict] -- A list of gen_kwargs dicts, one per worker. List-valued entries are partitioned; non-list values are shared across all workers.
_merge_gen_kwargs
| Name | Type | Required | Description |
|---|---|---|---|
| gen_kwargs_list | list[dict] |
Yes | A list of per-worker gen_kwargs dicts to merge.
|
Returns: dict -- A single merged gen_kwargs dict with list entries concatenated.
_shuffle_gen_kwargs
| Name | Type | Required | Description |
|---|---|---|---|
| rng | np.random.Generator |
Yes | A NumPy random number generator for reproducible shuffling. |
| gen_kwargs | dict |
Yes | The generation keyword arguments to shuffle. |
Returns: dict -- A new dict with list-valued entries shuffled. Lists of the same size receive the same permutation to maintain alignment.
Usage Examples
Splitting Work Across Workers
from datasets.utils.sharding import _split_gen_kwargs
gen_kwargs = {
"filepaths": ["shard_0.jsonl", "shard_1.jsonl", "shard_2.jsonl", "shard_3.jsonl"],
"split": "train", # non-list value shared across all workers
}
# Split across 2 workers
per_worker = _split_gen_kwargs(gen_kwargs, max_num_jobs=2)
print(per_worker[0]) # {"filepaths": ["shard_0.jsonl", "shard_1.jsonl"], "split": "train"}
print(per_worker[1]) # {"filepaths": ["shard_2.jsonl", "shard_3.jsonl"], "split": "train"}
Distributing Shards
from datasets.utils.sharding import _distribute_shards
# 10 shards across 3 workers
ranges = _distribute_shards(10, max_num_jobs=3)
print(ranges) # [range(0, 4), range(4, 7), range(7, 10)]
# More workers than shards
ranges = _distribute_shards(2, max_num_jobs=4)
print(ranges) # [range(0, 1), range(1, 2)]
Shuffling with Aligned Lists
import numpy as np
from datasets.utils.sharding import _shuffle_gen_kwargs
rng = np.random.default_rng(42)
gen_kwargs = {
"filepaths": ["a.jsonl", "b.jsonl", "c.jsonl"],
"metadata": ["meta_a", "meta_b", "meta_c"], # same length, stays aligned
"split": "train",
}
shuffled = _shuffle_gen_kwargs(rng, gen_kwargs)
# filepaths and metadata are shuffled with the same permutation
# split remains "train"