Implementation:Huggingface Datasets Dataset From Generator
| Knowledge Sources | |
|---|---|
| Domains | Data_Engineering, NLP |
| Last Updated | 2026-02-14 18:00 GMT |
Overview
Concrete tool for creating a Dataset from a Python generator function provided by the HuggingFace Datasets library.
Description
Dataset.from_generator is a static method that takes a callable generator function (one that yields dictionaries representing examples) and produces a cached, Arrow-backed Dataset. The generator is executed through the internal GeneratorDatasetInputStream pipeline, which handles serialization, caching, and optional multiprocessing. When num_proc is greater than 1, list-valued entries in gen_kwargs are automatically split across worker processes for parallel generation.
Usage
Use Dataset.from_generator when your data is produced lazily, such as reading from multiple files, streaming from an API, or performing on-the-fly transformations. It is the preferred method for large-scale custom dataset creation.
Code Reference
Source Location
- Repository: datasets
- File:
src/datasets/arrow_dataset.py - Lines: 1123-1203
Signature
@staticmethod
def from_generator(
generator: Callable,
features: Optional[Features] = None,
cache_dir: str = None,
keep_in_memory: bool = False,
gen_kwargs: Optional[dict] = None,
num_proc: Optional[int] = None,
split: NamedSplit = Split.TRAIN,
fingerprint: Optional[str] = None,
**kwargs,
):
Import
from datasets import Dataset
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| generator | Callable |
Yes | A generator function that yields example dictionaries. |
| features | Features |
No | Explicit dataset features schema. |
| cache_dir | str |
No | Directory to cache generated data. Defaults to ~/.cache/huggingface/datasets.
|
| keep_in_memory | bool |
No | Whether to keep the dataset in memory. Defaults to False. |
| gen_kwargs | dict |
No | Keyword arguments passed to the generator function. Used for sharding with multiprocessing. |
| num_proc | int |
No | Number of processes for parallel generation. Disabled by default. |
| split | NamedSplit |
No | Split name assigned to the dataset. Defaults to Split.TRAIN.
|
| fingerprint | str |
No | Custom fingerprint for cache identification. By default derived by hashing the generator and arguments. |
| **kwargs | No | Additional keyword arguments passed to GeneratorConfig.
|
Outputs
| Name | Type | Description |
|---|---|---|
| return | Dataset |
A cached, Arrow-backed Dataset generated from the generator output. |
Usage Examples
Basic Usage
from datasets import Dataset
def gen():
yield {"text": "Good", "label": 0}
yield {"text": "Bad", "label": 1}
ds = Dataset.from_generator(gen)
print(ds[0])
# {'text': 'Good', 'label': 0}
Sharded Multiprocessing
from datasets import Dataset
def gen(shards):
for shard in shards:
with open(shard) as f:
for line in f:
yield {"line": line}
shards = [f"data{i}.txt" for i in range(32)]
ds = Dataset.from_generator(gen, gen_kwargs={"shards": shards}, num_proc=4)