Implementation:Huggingface Datasets Dataset From Generator

Knowledge Sources	Huggingface Datasets HF Datasets Docs
Domains	Data_Engineering, NLP
Last Updated	2026-02-14 18:00 GMT

Overview

Concrete tool for creating a Dataset from a Python generator function provided by the HuggingFace Datasets library.

Description

Dataset.from_generator is a static method that takes a callable generator function (one that yields dictionaries representing examples) and produces a cached, Arrow-backed Dataset. The generator is executed through the internal GeneratorDatasetInputStream pipeline, which handles serialization, caching, and optional multiprocessing. When num_proc is greater than 1, list-valued entries in gen_kwargs are automatically split across worker processes for parallel generation.

Usage

Use Dataset.from_generator when your data is produced lazily, such as reading from multiple files, streaming from an API, or performing on-the-fly transformations. It is the preferred method for large-scale custom dataset creation.

Code Reference

Source Location

Repository: datasets
File: src/datasets/arrow_dataset.py
Lines: 1123-1203

Signature

@staticmethod
def from_generator(
    generator: Callable,
    features: Optional[Features] = None,
    cache_dir: str = None,
    keep_in_memory: bool = False,
    gen_kwargs: Optional[dict] = None,
    num_proc: Optional[int] = None,
    split: NamedSplit = Split.TRAIN,
    fingerprint: Optional[str] = None,
    **kwargs,
):

Import

from datasets import Dataset

I/O Contract

Inputs

Name	Type	Required	Description
generator	`Callable`	Yes	A generator function that yields example dictionaries.
features	`Features`	No	Explicit dataset features schema.
cache_dir	`str`	No	Directory to cache generated data. Defaults to `~/.cache/huggingface/datasets`.
keep_in_memory	`bool`	No	Whether to keep the dataset in memory. Defaults to False.
gen_kwargs	`dict`	No	Keyword arguments passed to the generator function. Used for sharding with multiprocessing.
num_proc	`int`	No	Number of processes for parallel generation. Disabled by default.
split	`NamedSplit`	No	Split name assigned to the dataset. Defaults to `Split.TRAIN`.
fingerprint	`str`	No	Custom fingerprint for cache identification. By default derived by hashing the generator and arguments.
**kwargs		No	Additional keyword arguments passed to `GeneratorConfig`.

Outputs

Name	Type	Description
return	`Dataset`	A cached, Arrow-backed Dataset generated from the generator output.

Usage Examples

Basic Usage

from datasets import Dataset

def gen():
    yield {"text": "Good", "label": 0}
    yield {"text": "Bad", "label": 1}

ds = Dataset.from_generator(gen)
print(ds[0])
# {'text': 'Good', 'label': 0}

Sharded Multiprocessing

from datasets import Dataset

def gen(shards):
    for shard in shards:
        with open(shard) as f:
            for line in f:
                yield {"line": line}

shards = [f"data{i}.txt" for i in range(32)]
ds = Dataset.from_generator(gen, gen_kwargs={"shards": shards}, num_proc=4)

Related Pages

Implements Principle

Principle:Huggingface_Datasets_Dataset_From_Generator_Construction

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment