Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Huggingface Datasets Py Utils

From Leeroopedia
Source src/datasets/utils/py_utils.py
Domain(s) Utilities, Data_Processing
Last Updated 2026-02-14

Overview

Description

The Py Utils module is a general-purpose utility collection that provides foundational helpers used throughout the datasets library. It contains functions for nested data structure manipulation, file size conversion, string parsing, context managers for temporary state, and multiprocessing utilities.

The module is organized into several functional groups:

Nested Data Processing:

  • map_nested -- Recursively applies a function to every leaf element of a nested data structure (dicts, lists, tuples, numpy arrays). Supports optional multiprocessing via num_proc, batched processing via batched and batch_size, and configurable type traversal. Delegates to _single_map_nested for per-worker execution.
  • NestedDataStructure -- A wrapper class with a flatten method that recursively collapses any nested combination of dicts, lists, and tuples into a flat list of leaf values.

File Size Utilities:

  • size_str -- Converts a byte count to a human-readable string (e.g., "1.50 GiB"). Returns "Unknown size" for None or zero.
  • convert_file_size_to_int -- Parses a human-readable size string (e.g., "50MB", "1MiB") back to an integer byte count. Supports both binary (KiB, MiB, GiB, TiB, PiB) and decimal (KB, MB, GB, TB, PB) units.

String and Dict Utilities:

  • string_to_dict -- Un-formats a string using a Python f-string-like pattern, extracting named groups into a dictionary. Returns None if the string does not match.
  • glob_pattern_to_regex -- Converts a filesystem glob pattern to a regular expression string.
  • NonMutableDict -- A dictionary subclass that raises ValueError when a key is overwritten, with a customizable error message.

Context Managers:

  • temporary_assignment -- Temporarily sets an attribute on an object and restores the original value on exit.
  • temp_seed -- Temporarily sets the random seed for numpy, and optionally for PyTorch and TensorFlow, restoring all RNG states on exit.

Multiprocessing Helpers:

  • iflatmap_unordered -- Applies a generator function across a multiprocessing pool, yielding results as they become available via a shared queue. Detects subprocess crashes and raises RuntimeError.
  • iter_batched -- Lazily partitions an iterable into fixed-size batches (lists), yielding a shorter final batch if the iterable length is not evenly divisible.

Other Utilities:

  • classproperty -- A descriptor combining @classmethod and @property behavior.
  • has_sufficient_disk_space -- Checks whether a directory has enough free disk space for a given byte count.
  • asdict -- Recursively converts a dataclass to a dictionary, omitting fields at their default values unless marked otherwise.
  • unique_values -- Generator that yields only unique elements from an iterable, preserving order.
  • copyfunc -- Creates a shallow copy of a function object.
  • zip_dict -- Iterates over multiple dicts grouped by their keys, yielding (key, tuple_of_values).

Usage

These utilities are used pervasively throughout the datasets library. map_nested is central to applying transformations across dataset metadata structures. size_str and convert_file_size_to_int are used in download managers and dataset info display. temp_seed supports reproducible dataset generation. iflatmap_unordered powers parallel shard generation.

Code Reference

Source Location

src/datasets/utils/py_utils.py (642 lines)

Signature

def map_nested(
    function: Callable[[Any], Any],
    data_struct: Any,
    dict_only: bool = False,
    map_list: bool = True,
    map_tuple: bool = False,
    map_numpy: bool = False,
    num_proc: Optional[int] = None,
    parallel_min_length: int = 2,
    batched: bool = False,
    batch_size: Optional[int] = 1000,
    types: Optional[tuple] = None,
    disable_tqdm: bool = True,
    desc: Optional[str] = None,
) -> Any: ...

def size_str(size_in_bytes) -> str: ...

def convert_file_size_to_int(size: Union[int, str]) -> int: ...

def string_to_dict(string: str, pattern: str) -> Optional[dict[str, str]]: ...

class NonMutableDict(dict): ...

class NestedDataStructure:
    def __init__(self, data=None): ...
    def flatten(self, data=None) -> list: ...

class classproperty(property): ...

def temporary_assignment(obj, attr, value): ...  # context manager
def temp_seed(seed: int, set_pytorch=False, set_tensorflow=False): ...  # context manager

def iflatmap_unordered(
    pool: Union[multiprocessing.pool.Pool, multiprocess.pool.Pool],
    func: Callable[..., Iterable[Y]],
    *,
    kwargs_iterable: Iterable[dict],
) -> Iterable[Y]: ...

def iter_batched(iterable: Iterable[T], n: int) -> Iterable[list[T]]: ...

Import

from datasets.utils.py_utils import map_nested, size_str, convert_file_size_to_int
from datasets.utils.py_utils import string_to_dict, temporary_assignment, temp_seed
from datasets.utils.py_utils import iflatmap_unordered, iter_batched
from datasets.utils.py_utils import NonMutableDict, NestedDataStructure, classproperty

I/O Contract

Inputs

Function Input Type Description
map_nested function Callable Function to apply to each leaf element
map_nested data_struct Any Nested structure of dicts, lists, tuples, or numpy arrays
map_nested num_proc Optional[int] Number of parallel processes (1 for sequential)
size_str size_in_bytes int or None Byte count to format
convert_file_size_to_int size Union[int, str] Size string like "50MB" or "1MiB"
string_to_dict string str Input string to parse
string_to_dict pattern str f-string-like pattern with named placeholders
iter_batched iterable Iterable[T] Items to batch
iter_batched n int Batch size (must be >= 1)

Outputs

Function Output Type Description
map_nested Any Same structure as input, with function applied to leaves
size_str str Human-readable size string (e.g., "1.50 GiB")
convert_file_size_to_int int Size in bytes
string_to_dict Optional[dict] Extracted key-value pairs, or None
iter_batched Iterable[list[T]] Successive batches of up to n items

Usage Examples

Applying a function to a nested structure:

from datasets.utils.py_utils import map_nested

data = {"train": [1, 2, 3], "test": [4, 5]}
result = map_nested(lambda x: x * 2, data)
# result == {"train": [2, 4, 6], "test": [8, 10]}

Converting file sizes:

from datasets.utils.py_utils import size_str, convert_file_size_to_int

size_str(1610612736)
# "1.50 GiB"

convert_file_size_to_int("1MiB")
# 1048576

convert_file_size_to_int("50MB")
# 50000000

Un-formatting a string with a pattern:

from datasets.utils.py_utils import string_to_dict

pattern = "hello, my name is {name} and I am a {age} year old {what}"
s = "hello, my name is cody and I am a 18 year old quarterback"
string_to_dict(s, pattern)
# {"age": "18", "name": "cody", "what": "quarterback"}

Flattening a nested structure:

from datasets.utils.py_utils import NestedDataStructure

nested = {"a": [1, [2, 3]], "b": {"c": 4}}
flat = NestedDataStructure(nested).flatten()
# flat == [1, 2, 3, 4]

Batching an iterable:

from datasets.utils.py_utils import iter_batched

batches = list(iter_batched(range(7), 3))
# batches == [[0, 1, 2], [3, 4, 5], [6]]

Temporarily setting a random seed:

from datasets.utils.py_utils import temp_seed
import numpy as np

with temp_seed(42):
    values = np.random.rand(3)
# numpy RNG state is restored after the block

Related Pages

  • Principle: Python Utilities -- The design principle governing how functions are recursively applied to nested data structures, including multiprocessing dispatch and type-aware traversal.

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment