Implementation:Huggingface Datasets Py Utils

Source	src/datasets/utils/py_utils.py
Domain(s)	Utilities, Data_Processing
Last Updated	2026-02-14

Overview

Description

The Py Utils module is a general-purpose utility collection that provides foundational helpers used throughout the datasets library. It contains functions for nested data structure manipulation, file size conversion, string parsing, context managers for temporary state, and multiprocessing utilities.

The module is organized into several functional groups:

Nested Data Processing:

map_nested -- Recursively applies a function to every leaf element of a nested data structure (dicts, lists, tuples, numpy arrays). Supports optional multiprocessing via num_proc, batched processing via batched and batch_size, and configurable type traversal. Delegates to _single_map_nested for per-worker execution.
NestedDataStructure -- A wrapper class with a flatten method that recursively collapses any nested combination of dicts, lists, and tuples into a flat list of leaf values.

File Size Utilities:

size_str -- Converts a byte count to a human-readable string (e.g., "1.50 GiB"). Returns "Unknown size" for None or zero.
convert_file_size_to_int -- Parses a human-readable size string (e.g., "50MB", "1MiB") back to an integer byte count. Supports both binary (KiB, MiB, GiB, TiB, PiB) and decimal (KB, MB, GB, TB, PB) units.

String and Dict Utilities:

string_to_dict -- Un-formats a string using a Python f-string-like pattern, extracting named groups into a dictionary. Returns None if the string does not match.
glob_pattern_to_regex -- Converts a filesystem glob pattern to a regular expression string.
NonMutableDict -- A dictionary subclass that raises ValueError when a key is overwritten, with a customizable error message.

Context Managers:

temporary_assignment -- Temporarily sets an attribute on an object and restores the original value on exit.
temp_seed -- Temporarily sets the random seed for numpy, and optionally for PyTorch and TensorFlow, restoring all RNG states on exit.

Multiprocessing Helpers:

iflatmap_unordered -- Applies a generator function across a multiprocessing pool, yielding results as they become available via a shared queue. Detects subprocess crashes and raises RuntimeError.
iter_batched -- Lazily partitions an iterable into fixed-size batches (lists), yielding a shorter final batch if the iterable length is not evenly divisible.

Other Utilities:

classproperty -- A descriptor combining @classmethod and @property behavior.
has_sufficient_disk_space -- Checks whether a directory has enough free disk space for a given byte count.
asdict -- Recursively converts a dataclass to a dictionary, omitting fields at their default values unless marked otherwise.
unique_values -- Generator that yields only unique elements from an iterable, preserving order.
copyfunc -- Creates a shallow copy of a function object.
zip_dict -- Iterates over multiple dicts grouped by their keys, yielding (key, tuple_of_values).

Usage

These utilities are used pervasively throughout the datasets library. map_nested is central to applying transformations across dataset metadata structures. size_str and convert_file_size_to_int are used in download managers and dataset info display. temp_seed supports reproducible dataset generation. iflatmap_unordered powers parallel shard generation.

Code Reference

Source Location

src/datasets/utils/py_utils.py (642 lines)

Signature

def map_nested(
    function: Callable[[Any], Any],
    data_struct: Any,
    dict_only: bool = False,
    map_list: bool = True,
    map_tuple: bool = False,
    map_numpy: bool = False,
    num_proc: Optional[int] = None,
    parallel_min_length: int = 2,
    batched: bool = False,
    batch_size: Optional[int] = 1000,
    types: Optional[tuple] = None,
    disable_tqdm: bool = True,
    desc: Optional[str] = None,
) -> Any: ...

def size_str(size_in_bytes) -> str: ...

def convert_file_size_to_int(size: Union[int, str]) -> int: ...

def string_to_dict(string: str, pattern: str) -> Optional[dict[str, str]]: ...

class NonMutableDict(dict): ...

class NestedDataStructure:
    def __init__(self, data=None): ...
    def flatten(self, data=None) -> list: ...

class classproperty(property): ...

def temporary_assignment(obj, attr, value): ...  # context manager
def temp_seed(seed: int, set_pytorch=False, set_tensorflow=False): ...  # context manager

def iflatmap_unordered(
    pool: Union[multiprocessing.pool.Pool, multiprocess.pool.Pool],
    func: Callable[..., Iterable[Y]],
    *,
    kwargs_iterable: Iterable[dict],
) -> Iterable[Y]: ...

def iter_batched(iterable: Iterable[T], n: int) -> Iterable[list[T]]: ...

Import

from datasets.utils.py_utils import map_nested, size_str, convert_file_size_to_int
from datasets.utils.py_utils import string_to_dict, temporary_assignment, temp_seed
from datasets.utils.py_utils import iflatmap_unordered, iter_batched
from datasets.utils.py_utils import NonMutableDict, NestedDataStructure, classproperty

I/O Contract

Inputs

Function	Input	Type	Description
`map_nested`	`function`	`Callable`	Function to apply to each leaf element
`map_nested`	`data_struct`	`Any`	Nested structure of dicts, lists, tuples, or numpy arrays
`map_nested`	`num_proc`	`Optional[int]`	Number of parallel processes (1 for sequential)
`size_str`	`size_in_bytes`	`int` or `None`	Byte count to format
`convert_file_size_to_int`	`size`	`Union[int, str]`	Size string like `"50MB"` or `"1MiB"`
`string_to_dict`	`string`	`str`	Input string to parse
`string_to_dict`	`pattern`	`str`	f-string-like pattern with named placeholders
`iter_batched`	`iterable`	`Iterable[T]`	Items to batch
`iter_batched`	`n`	`int`	Batch size (must be >= 1)

Outputs

Function	Output Type	Description
`map_nested`	`Any`	Same structure as input, with function applied to leaves
`size_str`	`str`	Human-readable size string (e.g., `"1.50 GiB"`)
`convert_file_size_to_int`	`int`	Size in bytes
`string_to_dict`	`Optional[dict]`	Extracted key-value pairs, or `None`
`iter_batched`	`Iterable[list[T]]`	Successive batches of up to `n` items

Usage Examples

Applying a function to a nested structure:

from datasets.utils.py_utils import map_nested

data = {"train": [1, 2, 3], "test": [4, 5]}
result = map_nested(lambda x: x * 2, data)
# result == {"train": [2, 4, 6], "test": [8, 10]}

Converting file sizes:

from datasets.utils.py_utils import size_str, convert_file_size_to_int

size_str(1610612736)
# "1.50 GiB"

convert_file_size_to_int("1MiB")
# 1048576

convert_file_size_to_int("50MB")
# 50000000

Un-formatting a string with a pattern:

from datasets.utils.py_utils import string_to_dict

pattern = "hello, my name is {name} and I am a {age} year old {what}"
s = "hello, my name is cody and I am a 18 year old quarterback"
string_to_dict(s, pattern)
# {"age": "18", "name": "cody", "what": "quarterback"}

Flattening a nested structure:

from datasets.utils.py_utils import NestedDataStructure

nested = {"a": [1, [2, 3]], "b": {"c": 4}}
flat = NestedDataStructure(nested).flatten()
# flat == [1, 2, 3, 4]

Batching an iterable:

from datasets.utils.py_utils import iter_batched

batches = list(iter_batched(range(7), 3))
# batches == [[0, 1, 2], [3, 4, 5], [6]]

Temporarily setting a random seed:

from datasets.utils.py_utils import temp_seed
import numpy as np

with temp_seed(42):
    values = np.random.rand(3)
# numpy RNG state is restored after the block

Related Pages

Principle: Python Utilities -- The design principle governing how functions are recursively applied to nested data structures, including multiprocessing dispatch and type-aware traversal.

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment