Implementation:Huggingface Datasets Load Dataset Builder

Knowledge Sources	Huggingface Datasets HF Datasets Docs
Domains	Data_Engineering, NLP
Last Updated	2026-02-14 18:00 GMT

Overview

Concrete tool for resolving a dataset identifier to a configured DatasetBuilder instance, provided by the HuggingFace Datasets library.

Description

load_dataset_builder takes a dataset path (Hub identifier, local directory, or packaged module name) along with optional configuration parameters and returns a fully configured DatasetBuilder instance. This builder can then be used to inspect dataset metadata via .info, download and prepare data via .download_and_prepare(), or create a streaming dataset via .as_streaming_dataset(). The function handles module factory resolution, builder class lookup, parameter merging, packaged module validation, and builder instantiation.

Usage

Use load_dataset_builder when you want to obtain a DatasetBuilder object for inspection or step-by-step dataset preparation, rather than directly loading data into a Dataset object. This is the function called internally by load_dataset before triggering download_and_prepare.

Code Reference

Source Location

Repository: datasets
File: src/datasets/load.py
Lines: 1034-1181

Signature

def load_dataset_builder(
    path: str,
    name: Optional[str] = None,
    data_dir: Optional[str] = None,
    data_files: Optional[Union[str, Sequence[str], Mapping[str, Union[str, Sequence[str]]]]] = None,
    cache_dir: Optional[str] = None,
    features: Optional[Features] = None,
    download_config: Optional[DownloadConfig] = None,
    download_mode: Optional[Union[DownloadMode, str]] = None,
    revision: Optional[Union[str, Version]] = None,
    token: Optional[Union[bool, str]] = None,
    storage_options: Optional[dict] = None,
    **config_kwargs,
) -> DatasetBuilder:

Import

from datasets import load_dataset_builder

I/O Contract

Inputs

Name	Type	Required	Description
path	`str`	Yes	Path or name of the dataset. Can be a Hub repository (e.g. `'username/dataset_name'`), a local directory, or a packaged builder name (e.g. `'csv'`, `'parquet'`).
name	`Optional[str]`	No	Name of the dataset configuration.
data_dir	`Optional[str]`	No	Directory containing the data files. For generic builders, behaves like `os.path.join(data_dir, **)` as data_files.
data_files	`Optional[Union[str, Sequence[str], Mapping[str, Union[str, Sequence[str]]]]]`	No	Path(s) to source data file(s).
cache_dir	`Optional[str]`	No	Directory to read/write data. Defaults to `"~/.cache/huggingface/datasets"`.
features	`Optional[Features]`	No	Set the features type to use for this dataset.
download_config	`Optional[DownloadConfig]`	No	Specific download configuration parameters.
download_mode	`Optional[Union[DownloadMode, str]]`	No	Download/generate mode. Defaults to `REUSE_DATASET_IF_EXISTS`.
revision	`Optional[Union[str, Version]]`	No	Version of the dataset to load (commit SHA, git tag, or branch).
token	`Optional[Union[bool, str]]`	No	Bearer token for Hub authentication. If `True`, reads from `~/.huggingface`.
storage_options	`Optional[dict]`	No	Key/value pairs passed to the dataset file-system backend. Experimental, added in v2.11.0.
**config_kwargs	keyword arguments	No	Additional keyword arguments passed to `BuilderConfig` and `DatasetBuilder`.

Outputs

Name	Type	Description
builder	`DatasetBuilder`	A fully configured `DatasetBuilder` instance ready for metadata inspection or data preparation.

Usage Examples

Basic Usage

from datasets import load_dataset_builder

# Get a builder for the Rotten Tomatoes dataset
ds_builder = load_dataset_builder("cornell-movie-review-data/rotten_tomatoes")

# Inspect features
print(ds_builder.info.features)
# {'label': ClassLabel(names=['neg', 'pos']), 'text': Value('string')}

Step-by-Step Download and Prepare

from datasets import load_dataset_builder

ds_builder = load_dataset_builder("cornell-movie-review-data/rotten_tomatoes")

# Download and prepare the dataset
ds_builder.download_and_prepare()

# Access the prepared dataset
ds = ds_builder.as_dataset()

With a Packaged Builder

from datasets import load_dataset_builder

# Use the CSV builder with specific data files
ds_builder = load_dataset_builder(
    "csv",
    data_files={"train": "train.csv", "test": "test.csv"},
)
print(ds_builder.info)

Related Pages

Implements Principle

Principle:Huggingface_Datasets_Dataset_Builder_Resolution

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment