Implementation:Huggingface Datasets Generator Builder
| Knowledge Sources | |
|---|---|
| Domains | Data_Loading, Custom_Data |
| Last Updated | 2026-02-14 18:00 GMT |
Overview
Packaged dataset builder that wraps user-supplied Python generator functions into Arrow-backed datasets provided by the HuggingFace Datasets library.
Description
Generator is a packaged dataset builder extending GeneratorBasedBuilder that allows users to create datasets from arbitrary Python generator functions. It is configured via GeneratorConfig, a dataclass extending BuilderConfig, which accepts a callable generator, optional gen_kwargs (keyword arguments passed to the generator), optional features for schema specification, and a split defaulting to datasets.Split.TRAIN. The generator field is required and validated in __post_init__. The builder produces a single split from the provided generator, and its _generate_examples method supports sharded generation by splitting gen_kwargs across multiple shards and yielding (Key(shard_idx, sample_idx), sample) tuples.
Usage
Generator is used internally by Dataset.from_generator() and is not typically instantiated directly. Users supply a Python generator function and keyword arguments through the from_generator API, which delegates to this builder.
Code Reference
Source Location
- Repository: datasets
- File:
src/datasets/packaged_modules/generator/generator.py - Lines: 1-38
Signature
@dataclass
class GeneratorConfig(datasets.BuilderConfig):
generator: Optional[Callable] = None
gen_kwargs: Optional[dict] = None
features: Optional[datasets.Features] = None
split: datasets.NamedSplit = datasets.Split.TRAIN
class Generator(datasets.GeneratorBasedBuilder):
BUILDER_CONFIG_CLASS = GeneratorConfig
def _info(self):
def _split_generators(self, dl_manager):
def _generate_examples(self, **gen_kwargs):
Import
from datasets.packaged_modules.generator.generator import Generator, GeneratorConfig
I/O Contract
Inputs (GeneratorConfig)
| Name | Type | Required | Description |
|---|---|---|---|
| generator | Optional[Callable] |
Yes | A Python callable (generator function) that yields dataset examples. Raises ValueError if not provided.
|
| gen_kwargs | Optional[dict] |
No | Keyword arguments to pass to the generator function. Defaults to an empty dict. |
| features | Optional[Features] |
No | Schema describing the dataset features. If None, features are inferred. |
| split | NamedSplit |
No | The split name assigned to the generated data. Defaults to datasets.Split.TRAIN.
|
Outputs
| Name | Type | Description |
|---|---|---|
| dataset | Dataset |
An Arrow-backed Dataset constructed from the yielded examples of the generator function. |
Usage Examples
Basic Usage
from datasets import Dataset
# Define a generator function
def my_generator():
for i in range(100):
yield {"id": i, "text": f"Example {i}"}
# Create dataset from generator (internally uses Generator builder)
ds = Dataset.from_generator(my_generator)
print(ds[0]) # {"id": 0, "text": "Example 0"}
Generator with Keyword Arguments
from datasets import Dataset
def my_generator(filepath, split):
with open(filepath) as f:
for idx, line in enumerate(f):
yield {"id": idx, "text": line.strip(), "split": split}
ds = Dataset.from_generator(
my_generator,
gen_kwargs={"filepath": "data/train.txt", "split": "train"},
)