Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Huggingface Datasets Load Dataset Builder

From Leeroopedia
Knowledge Sources
Domains Data_Engineering, NLP
Last Updated 2026-02-14 18:00 GMT

Overview

Concrete tool for resolving a dataset identifier to a configured DatasetBuilder instance, provided by the HuggingFace Datasets library.

Description

load_dataset_builder takes a dataset path (Hub identifier, local directory, or packaged module name) along with optional configuration parameters and returns a fully configured DatasetBuilder instance. This builder can then be used to inspect dataset metadata via .info, download and prepare data via .download_and_prepare(), or create a streaming dataset via .as_streaming_dataset(). The function handles module factory resolution, builder class lookup, parameter merging, packaged module validation, and builder instantiation.

Usage

Use load_dataset_builder when you want to obtain a DatasetBuilder object for inspection or step-by-step dataset preparation, rather than directly loading data into a Dataset object. This is the function called internally by load_dataset before triggering download_and_prepare.

Code Reference

Source Location

  • Repository: datasets
  • File: src/datasets/load.py
  • Lines: 1034-1181

Signature

def load_dataset_builder(
    path: str,
    name: Optional[str] = None,
    data_dir: Optional[str] = None,
    data_files: Optional[Union[str, Sequence[str], Mapping[str, Union[str, Sequence[str]]]]] = None,
    cache_dir: Optional[str] = None,
    features: Optional[Features] = None,
    download_config: Optional[DownloadConfig] = None,
    download_mode: Optional[Union[DownloadMode, str]] = None,
    revision: Optional[Union[str, Version]] = None,
    token: Optional[Union[bool, str]] = None,
    storage_options: Optional[dict] = None,
    **config_kwargs,
) -> DatasetBuilder:

Import

from datasets import load_dataset_builder

I/O Contract

Inputs

Name Type Required Description
path str Yes Path or name of the dataset. Can be a Hub repository (e.g. 'username/dataset_name'), a local directory, or a packaged builder name (e.g. 'csv', 'parquet').
name Optional[str] No Name of the dataset configuration.
data_dir Optional[str] No Directory containing the data files. For generic builders, behaves like os.path.join(data_dir, **) as data_files.
data_files Optional[Union[str, Sequence[str], Mapping[str, Union[str, Sequence[str]]]]] No Path(s) to source data file(s).
cache_dir Optional[str] No Directory to read/write data. Defaults to "~/.cache/huggingface/datasets".
features Optional[Features] No Set the features type to use for this dataset.
download_config Optional[DownloadConfig] No Specific download configuration parameters.
download_mode Optional[Union[DownloadMode, str]] No Download/generate mode. Defaults to REUSE_DATASET_IF_EXISTS.
revision Optional[Union[str, Version]] No Version of the dataset to load (commit SHA, git tag, or branch).
token Optional[Union[bool, str]] No Bearer token for Hub authentication. If True, reads from ~/.huggingface.
storage_options Optional[dict] No Key/value pairs passed to the dataset file-system backend. Experimental, added in v2.11.0.
**config_kwargs keyword arguments No Additional keyword arguments passed to BuilderConfig and DatasetBuilder.

Outputs

Name Type Description
builder DatasetBuilder A fully configured DatasetBuilder instance ready for metadata inspection or data preparation.

Usage Examples

Basic Usage

from datasets import load_dataset_builder

# Get a builder for the Rotten Tomatoes dataset
ds_builder = load_dataset_builder("cornell-movie-review-data/rotten_tomatoes")

# Inspect features
print(ds_builder.info.features)
# {'label': ClassLabel(names=['neg', 'pos']), 'text': Value('string')}

Step-by-Step Download and Prepare

from datasets import load_dataset_builder

ds_builder = load_dataset_builder("cornell-movie-review-data/rotten_tomatoes")

# Download and prepare the dataset
ds_builder.download_and_prepare()

# Access the prepared dataset
ds = ds_builder.as_dataset()

With a Packaged Builder

from datasets import load_dataset_builder

# Use the CSV builder with specific data files
ds_builder = load_dataset_builder(
    "csv",
    data_files={"train": "train.csv", "test": "test.csv"},
)
print(ds_builder.info)

Related Pages

Implements Principle

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment