Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Huggingface Datasets Dataset From Generator

From Leeroopedia
Knowledge Sources
Domains Data_Engineering, NLP
Last Updated 2026-02-14 18:00 GMT

Overview

Concrete tool for creating a Dataset from a Python generator function provided by the HuggingFace Datasets library.

Description

Dataset.from_generator is a static method that takes a callable generator function (one that yields dictionaries representing examples) and produces a cached, Arrow-backed Dataset. The generator is executed through the internal GeneratorDatasetInputStream pipeline, which handles serialization, caching, and optional multiprocessing. When num_proc is greater than 1, list-valued entries in gen_kwargs are automatically split across worker processes for parallel generation.

Usage

Use Dataset.from_generator when your data is produced lazily, such as reading from multiple files, streaming from an API, or performing on-the-fly transformations. It is the preferred method for large-scale custom dataset creation.

Code Reference

Source Location

  • Repository: datasets
  • File: src/datasets/arrow_dataset.py
  • Lines: 1123-1203

Signature

@staticmethod
def from_generator(
    generator: Callable,
    features: Optional[Features] = None,
    cache_dir: str = None,
    keep_in_memory: bool = False,
    gen_kwargs: Optional[dict] = None,
    num_proc: Optional[int] = None,
    split: NamedSplit = Split.TRAIN,
    fingerprint: Optional[str] = None,
    **kwargs,
):

Import

from datasets import Dataset

I/O Contract

Inputs

Name Type Required Description
generator Callable Yes A generator function that yields example dictionaries.
features Features No Explicit dataset features schema.
cache_dir str No Directory to cache generated data. Defaults to ~/.cache/huggingface/datasets.
keep_in_memory bool No Whether to keep the dataset in memory. Defaults to False.
gen_kwargs dict No Keyword arguments passed to the generator function. Used for sharding with multiprocessing.
num_proc int No Number of processes for parallel generation. Disabled by default.
split NamedSplit No Split name assigned to the dataset. Defaults to Split.TRAIN.
fingerprint str No Custom fingerprint for cache identification. By default derived by hashing the generator and arguments.
**kwargs No Additional keyword arguments passed to GeneratorConfig.

Outputs

Name Type Description
return Dataset A cached, Arrow-backed Dataset generated from the generator output.

Usage Examples

Basic Usage

from datasets import Dataset

def gen():
    yield {"text": "Good", "label": 0}
    yield {"text": "Bad", "label": 1}

ds = Dataset.from_generator(gen)
print(ds[0])
# {'text': 'Good', 'label': 0}

Sharded Multiprocessing

from datasets import Dataset

def gen(shards):
    for shard in shards:
        with open(shard) as f:
            for line in f:
                yield {"line": line}

shards = [f"data{i}.txt" for i in range(32)]
ds = Dataset.from_generator(gen, gen_kwargs={"shards": shards}, num_proc=4)

Related Pages

Implements Principle

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment