Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Huggingface Datasets Sharding Utils

From Leeroopedia
Knowledge Sources
Domains Distributed_Computing, Data_Processing
Last Updated 2026-02-14 18:00 GMT

Overview

Utilities for distributing dataset generation work across multiple shards and workers.

Description

This module provides functions used internally by the dataset builder infrastructure to split dataset generation work across multiple parallel workers. The core concept is that gen_kwargs -- the keyword arguments passed to a builder's _generate_examples or _generate_tables method -- contain lists that represent data sources (e.g., file paths). These lists can be partitioned across workers for parallel processing.

The module provides five functions:

  • _number_of_shards_in_gen_kwargs: Determines how many shards are available by examining list-valued entries in gen_kwargs. All lists must have the same length (or an ambiguity error is raised), and non-list values are treated as shared across all shards.
  • _distribute_shards: Divides num_shards indices into at most max_num_jobs contiguous ranges, distributing work as evenly as possible. Preserves shard ordering.
  • _split_gen_kwargs: Combines the two above to partition a gen_kwargs dict into a list of per-worker gen_kwargs dicts, where each worker gets a subset of the list-valued entries while non-list values are shared.
  • _merge_gen_kwargs: The inverse of _split_gen_kwargs -- merges a list of per-worker gen_kwargs dicts back into a single dict by concatenating the list-valued entries.
  • _shuffle_gen_kwargs: Returns a shuffled copy of gen_kwargs, using a NumPy random generator. Lists of the same size are shuffled with the same permutation, preserving alignment between entangled lists (e.g., file paths and their metadata).

Usage

These functions are used internally by the dataset builder infrastructure (e.g., in DatasetBuilder._prepare_split) and are not typically called directly by end users. They enable multiprocessing during dataset generation by splitting work across workers.

Code Reference

Source Location

  • Repository: datasets
  • File: src/datasets/utils/sharding.py
  • Lines: 1-92

Signature

def _number_of_shards_in_gen_kwargs(gen_kwargs: dict) -> int:
    """Return the number of possible shards according to the input gen_kwargs."""

def _distribute_shards(num_shards: int, max_num_jobs: int) -> list[range]:
    """Get the range of shard indices per job."""

def _split_gen_kwargs(gen_kwargs: dict, max_num_jobs: int) -> list[dict]:
    """Split the gen_kwargs into `max_num_job` gen_kwargs."""

def _merge_gen_kwargs(gen_kwargs_list: list[dict]) -> dict:
    """Merge a list of per-worker gen_kwargs back into a single dict."""

def _shuffle_gen_kwargs(rng: np.random.Generator, gen_kwargs: dict) -> dict:
    """Return a shuffled copy of the input gen_kwargs."""

Import

from datasets.utils.sharding import _distribute_shards, _split_gen_kwargs

I/O Contract

_number_of_shards_in_gen_kwargs

Name Type Required Description
gen_kwargs dict Yes The generation keyword arguments. List-valued entries represent data sources to be sharded. All lists must have the same length.

Returns: int -- The number of shards (max list length, minimum 1).

Raises: RuntimeError -- If lists of different lengths are found, making sharding ambiguous.

_distribute_shards

Name Type Required Description
num_shards int Yes Total number of shards to distribute.
max_num_jobs int Yes Maximum number of workers/jobs.

Returns: list[range] -- A list of contiguous range objects, one per active job. Each range specifies the shard indices assigned to that job.

Example:

>>> _distribute_shards(2, max_num_jobs=4)
[range(0, 1), range(1, 2)]
>>> _distribute_shards(10, max_num_jobs=3)
[range(0, 4), range(4, 7), range(7, 10)]

_split_gen_kwargs

Name Type Required Description
gen_kwargs dict Yes The generation keyword arguments to split.
max_num_jobs int Yes Maximum number of parallel workers.

Returns: list[dict] -- A list of gen_kwargs dicts, one per worker. List-valued entries are partitioned; non-list values are shared across all workers.

_merge_gen_kwargs

Name Type Required Description
gen_kwargs_list list[dict] Yes A list of per-worker gen_kwargs dicts to merge.

Returns: dict -- A single merged gen_kwargs dict with list entries concatenated.

_shuffle_gen_kwargs

Name Type Required Description
rng np.random.Generator Yes A NumPy random number generator for reproducible shuffling.
gen_kwargs dict Yes The generation keyword arguments to shuffle.

Returns: dict -- A new dict with list-valued entries shuffled. Lists of the same size receive the same permutation to maintain alignment.

Usage Examples

Splitting Work Across Workers

from datasets.utils.sharding import _split_gen_kwargs

gen_kwargs = {
    "filepaths": ["shard_0.jsonl", "shard_1.jsonl", "shard_2.jsonl", "shard_3.jsonl"],
    "split": "train",  # non-list value shared across all workers
}

# Split across 2 workers
per_worker = _split_gen_kwargs(gen_kwargs, max_num_jobs=2)
print(per_worker[0])  # {"filepaths": ["shard_0.jsonl", "shard_1.jsonl"], "split": "train"}
print(per_worker[1])  # {"filepaths": ["shard_2.jsonl", "shard_3.jsonl"], "split": "train"}

Distributing Shards

from datasets.utils.sharding import _distribute_shards

# 10 shards across 3 workers
ranges = _distribute_shards(10, max_num_jobs=3)
print(ranges)  # [range(0, 4), range(4, 7), range(7, 10)]

# More workers than shards
ranges = _distribute_shards(2, max_num_jobs=4)
print(ranges)  # [range(0, 1), range(1, 2)]

Shuffling with Aligned Lists

import numpy as np
from datasets.utils.sharding import _shuffle_gen_kwargs

rng = np.random.default_rng(42)
gen_kwargs = {
    "filepaths": ["a.jsonl", "b.jsonl", "c.jsonl"],
    "metadata": ["meta_a", "meta_b", "meta_c"],  # same length, stays aligned
    "split": "train",
}

shuffled = _shuffle_gen_kwargs(rng, gen_kwargs)
# filepaths and metadata are shuffled with the same permutation
# split remains "train"

Related Pages

Implements Principle

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment