Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Huggingface Datasets Generator Builder

From Leeroopedia
Knowledge Sources
Domains Data_Loading, Custom_Data
Last Updated 2026-02-14 18:00 GMT

Overview

Packaged dataset builder that wraps user-supplied Python generator functions into Arrow-backed datasets provided by the HuggingFace Datasets library.

Description

Generator is a packaged dataset builder extending GeneratorBasedBuilder that allows users to create datasets from arbitrary Python generator functions. It is configured via GeneratorConfig, a dataclass extending BuilderConfig, which accepts a callable generator, optional gen_kwargs (keyword arguments passed to the generator), optional features for schema specification, and a split defaulting to datasets.Split.TRAIN. The generator field is required and validated in __post_init__. The builder produces a single split from the provided generator, and its _generate_examples method supports sharded generation by splitting gen_kwargs across multiple shards and yielding (Key(shard_idx, sample_idx), sample) tuples.

Usage

Generator is used internally by Dataset.from_generator() and is not typically instantiated directly. Users supply a Python generator function and keyword arguments through the from_generator API, which delegates to this builder.

Code Reference

Source Location

  • Repository: datasets
  • File: src/datasets/packaged_modules/generator/generator.py
  • Lines: 1-38

Signature

@dataclass
class GeneratorConfig(datasets.BuilderConfig):
    generator: Optional[Callable] = None
    gen_kwargs: Optional[dict] = None
    features: Optional[datasets.Features] = None
    split: datasets.NamedSplit = datasets.Split.TRAIN


class Generator(datasets.GeneratorBasedBuilder):
    BUILDER_CONFIG_CLASS = GeneratorConfig

    def _info(self):
    def _split_generators(self, dl_manager):
    def _generate_examples(self, **gen_kwargs):

Import

from datasets.packaged_modules.generator.generator import Generator, GeneratorConfig

I/O Contract

Inputs (GeneratorConfig)

Name Type Required Description
generator Optional[Callable] Yes A Python callable (generator function) that yields dataset examples. Raises ValueError if not provided.
gen_kwargs Optional[dict] No Keyword arguments to pass to the generator function. Defaults to an empty dict.
features Optional[Features] No Schema describing the dataset features. If None, features are inferred.
split NamedSplit No The split name assigned to the generated data. Defaults to datasets.Split.TRAIN.

Outputs

Name Type Description
dataset Dataset An Arrow-backed Dataset constructed from the yielded examples of the generator function.

Usage Examples

Basic Usage

from datasets import Dataset

# Define a generator function
def my_generator():
    for i in range(100):
        yield {"id": i, "text": f"Example {i}"}

# Create dataset from generator (internally uses Generator builder)
ds = Dataset.from_generator(my_generator)
print(ds[0])  # {"id": 0, "text": "Example 0"}

Generator with Keyword Arguments

from datasets import Dataset

def my_generator(filepath, split):
    with open(filepath) as f:
        for idx, line in enumerate(f):
            yield {"id": idx, "text": line.strip(), "split": split}

ds = Dataset.from_generator(
    my_generator,
    gen_kwargs={"filepath": "data/train.txt", "split": "train"},
)

Related Pages

Implements Principle

Requires Environment

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment