Implementation:Huggingface Datasets Py Utils
| Source | src/datasets/utils/py_utils.py |
|---|---|
| Domain(s) | Utilities, Data_Processing |
| Last Updated | 2026-02-14 |
Overview
Description
The Py Utils module is a general-purpose utility collection that provides foundational helpers used throughout the datasets library. It contains functions for nested data structure manipulation, file size conversion, string parsing, context managers for temporary state, and multiprocessing utilities.
The module is organized into several functional groups:
Nested Data Processing:
map_nested-- Recursively applies a function to every leaf element of a nested data structure (dicts, lists, tuples, numpy arrays). Supports optional multiprocessing vianum_proc, batched processing viabatchedandbatch_size, and configurable type traversal. Delegates to_single_map_nestedfor per-worker execution.NestedDataStructure-- A wrapper class with aflattenmethod that recursively collapses any nested combination of dicts, lists, and tuples into a flat list of leaf values.
File Size Utilities:
size_str-- Converts a byte count to a human-readable string (e.g.,"1.50 GiB"). Returns"Unknown size"forNoneor zero.convert_file_size_to_int-- Parses a human-readable size string (e.g.,"50MB","1MiB") back to an integer byte count. Supports both binary (KiB, MiB, GiB, TiB, PiB) and decimal (KB, MB, GB, TB, PB) units.
String and Dict Utilities:
string_to_dict-- Un-formats a string using a Python f-string-like pattern, extracting named groups into a dictionary. ReturnsNoneif the string does not match.glob_pattern_to_regex-- Converts a filesystem glob pattern to a regular expression string.NonMutableDict-- A dictionary subclass that raisesValueErrorwhen a key is overwritten, with a customizable error message.
Context Managers:
temporary_assignment-- Temporarily sets an attribute on an object and restores the original value on exit.temp_seed-- Temporarily sets the random seed for numpy, and optionally for PyTorch and TensorFlow, restoring all RNG states on exit.
Multiprocessing Helpers:
iflatmap_unordered-- Applies a generator function across a multiprocessing pool, yielding results as they become available via a shared queue. Detects subprocess crashes and raisesRuntimeError.iter_batched-- Lazily partitions an iterable into fixed-size batches (lists), yielding a shorter final batch if the iterable length is not evenly divisible.
Other Utilities:
classproperty-- A descriptor combining@classmethodand@propertybehavior.has_sufficient_disk_space-- Checks whether a directory has enough free disk space for a given byte count.asdict-- Recursively converts a dataclass to a dictionary, omitting fields at their default values unless marked otherwise.unique_values-- Generator that yields only unique elements from an iterable, preserving order.copyfunc-- Creates a shallow copy of a function object.zip_dict-- Iterates over multiple dicts grouped by their keys, yielding(key, tuple_of_values).
Usage
These utilities are used pervasively throughout the datasets library. map_nested is central to applying transformations across dataset metadata structures. size_str and convert_file_size_to_int are used in download managers and dataset info display. temp_seed supports reproducible dataset generation. iflatmap_unordered powers parallel shard generation.
Code Reference
Source Location
src/datasets/utils/py_utils.py (642 lines)
Signature
def map_nested(
function: Callable[[Any], Any],
data_struct: Any,
dict_only: bool = False,
map_list: bool = True,
map_tuple: bool = False,
map_numpy: bool = False,
num_proc: Optional[int] = None,
parallel_min_length: int = 2,
batched: bool = False,
batch_size: Optional[int] = 1000,
types: Optional[tuple] = None,
disable_tqdm: bool = True,
desc: Optional[str] = None,
) -> Any: ...
def size_str(size_in_bytes) -> str: ...
def convert_file_size_to_int(size: Union[int, str]) -> int: ...
def string_to_dict(string: str, pattern: str) -> Optional[dict[str, str]]: ...
class NonMutableDict(dict): ...
class NestedDataStructure:
def __init__(self, data=None): ...
def flatten(self, data=None) -> list: ...
class classproperty(property): ...
def temporary_assignment(obj, attr, value): ... # context manager
def temp_seed(seed: int, set_pytorch=False, set_tensorflow=False): ... # context manager
def iflatmap_unordered(
pool: Union[multiprocessing.pool.Pool, multiprocess.pool.Pool],
func: Callable[..., Iterable[Y]],
*,
kwargs_iterable: Iterable[dict],
) -> Iterable[Y]: ...
def iter_batched(iterable: Iterable[T], n: int) -> Iterable[list[T]]: ...
Import
from datasets.utils.py_utils import map_nested, size_str, convert_file_size_to_int
from datasets.utils.py_utils import string_to_dict, temporary_assignment, temp_seed
from datasets.utils.py_utils import iflatmap_unordered, iter_batched
from datasets.utils.py_utils import NonMutableDict, NestedDataStructure, classproperty
I/O Contract
Inputs
| Function | Input | Type | Description |
|---|---|---|---|
map_nested |
function |
Callable |
Function to apply to each leaf element |
map_nested |
data_struct |
Any |
Nested structure of dicts, lists, tuples, or numpy arrays |
map_nested |
num_proc |
Optional[int] |
Number of parallel processes (1 for sequential) |
size_str |
size_in_bytes |
int or None |
Byte count to format |
convert_file_size_to_int |
size |
Union[int, str] |
Size string like "50MB" or "1MiB"
|
string_to_dict |
string |
str |
Input string to parse |
string_to_dict |
pattern |
str |
f-string-like pattern with named placeholders |
iter_batched |
iterable |
Iterable[T] |
Items to batch |
iter_batched |
n |
int |
Batch size (must be >= 1) |
Outputs
| Function | Output Type | Description |
|---|---|---|
map_nested |
Any |
Same structure as input, with function applied to leaves |
size_str |
str |
Human-readable size string (e.g., "1.50 GiB")
|
convert_file_size_to_int |
int |
Size in bytes |
string_to_dict |
Optional[dict] |
Extracted key-value pairs, or None
|
iter_batched |
Iterable[list[T]] |
Successive batches of up to n items
|
Usage Examples
Applying a function to a nested structure:
from datasets.utils.py_utils import map_nested
data = {"train": [1, 2, 3], "test": [4, 5]}
result = map_nested(lambda x: x * 2, data)
# result == {"train": [2, 4, 6], "test": [8, 10]}
Converting file sizes:
from datasets.utils.py_utils import size_str, convert_file_size_to_int
size_str(1610612736)
# "1.50 GiB"
convert_file_size_to_int("1MiB")
# 1048576
convert_file_size_to_int("50MB")
# 50000000
Un-formatting a string with a pattern:
from datasets.utils.py_utils import string_to_dict
pattern = "hello, my name is {name} and I am a {age} year old {what}"
s = "hello, my name is cody and I am a 18 year old quarterback"
string_to_dict(s, pattern)
# {"age": "18", "name": "cody", "what": "quarterback"}
Flattening a nested structure:
from datasets.utils.py_utils import NestedDataStructure
nested = {"a": [1, [2, 3]], "b": {"c": 4}}
flat = NestedDataStructure(nested).flatten()
# flat == [1, 2, 3, 4]
Batching an iterable:
from datasets.utils.py_utils import iter_batched
batches = list(iter_batched(range(7), 3))
# batches == [[0, 1, 2], [3, 4, 5], [6]]
Temporarily setting a random seed:
from datasets.utils.py_utils import temp_seed
import numpy as np
with temp_seed(42):
values = np.random.rand(3)
# numpy RNG state is restored after the block
Related Pages
- Principle: Python Utilities -- The design principle governing how functions are recursively applied to nested data structures, including multiprocessing dispatch and type-aware traversal.