Implementation:Huggingface Datasets Load Dataset Builder
| Knowledge Sources | |
|---|---|
| Domains | Data_Engineering, NLP |
| Last Updated | 2026-02-14 18:00 GMT |
Overview
Concrete tool for resolving a dataset identifier to a configured DatasetBuilder instance, provided by the HuggingFace Datasets library.
Description
load_dataset_builder takes a dataset path (Hub identifier, local directory, or packaged module name) along with optional configuration parameters and returns a fully configured DatasetBuilder instance. This builder can then be used to inspect dataset metadata via .info, download and prepare data via .download_and_prepare(), or create a streaming dataset via .as_streaming_dataset(). The function handles module factory resolution, builder class lookup, parameter merging, packaged module validation, and builder instantiation.
Usage
Use load_dataset_builder when you want to obtain a DatasetBuilder object for inspection or step-by-step dataset preparation, rather than directly loading data into a Dataset object. This is the function called internally by load_dataset before triggering download_and_prepare.
Code Reference
Source Location
- Repository: datasets
- File:
src/datasets/load.py - Lines: 1034-1181
Signature
def load_dataset_builder(
path: str,
name: Optional[str] = None,
data_dir: Optional[str] = None,
data_files: Optional[Union[str, Sequence[str], Mapping[str, Union[str, Sequence[str]]]]] = None,
cache_dir: Optional[str] = None,
features: Optional[Features] = None,
download_config: Optional[DownloadConfig] = None,
download_mode: Optional[Union[DownloadMode, str]] = None,
revision: Optional[Union[str, Version]] = None,
token: Optional[Union[bool, str]] = None,
storage_options: Optional[dict] = None,
**config_kwargs,
) -> DatasetBuilder:
Import
from datasets import load_dataset_builder
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| path | str |
Yes | Path or name of the dataset. Can be a Hub repository (e.g. 'username/dataset_name'), a local directory, or a packaged builder name (e.g. 'csv', 'parquet').
|
| name | Optional[str] |
No | Name of the dataset configuration. |
| data_dir | Optional[str] |
No | Directory containing the data files. For generic builders, behaves like os.path.join(data_dir, **) as data_files.
|
| data_files | Optional[Union[str, Sequence[str], Mapping[str, Union[str, Sequence[str]]]]] |
No | Path(s) to source data file(s). |
| cache_dir | Optional[str] |
No | Directory to read/write data. Defaults to "~/.cache/huggingface/datasets".
|
| features | Optional[Features] |
No | Set the features type to use for this dataset. |
| download_config | Optional[DownloadConfig] |
No | Specific download configuration parameters. |
| download_mode | Optional[Union[DownloadMode, str]] |
No | Download/generate mode. Defaults to REUSE_DATASET_IF_EXISTS.
|
| revision | Optional[Union[str, Version]] |
No | Version of the dataset to load (commit SHA, git tag, or branch). |
| token | Optional[Union[bool, str]] |
No | Bearer token for Hub authentication. If True, reads from ~/.huggingface.
|
| storage_options | Optional[dict] |
No | Key/value pairs passed to the dataset file-system backend. Experimental, added in v2.11.0. |
| **config_kwargs | keyword arguments | No | Additional keyword arguments passed to BuilderConfig and DatasetBuilder.
|
Outputs
| Name | Type | Description |
|---|---|---|
| builder | DatasetBuilder |
A fully configured DatasetBuilder instance ready for metadata inspection or data preparation.
|
Usage Examples
Basic Usage
from datasets import load_dataset_builder
# Get a builder for the Rotten Tomatoes dataset
ds_builder = load_dataset_builder("cornell-movie-review-data/rotten_tomatoes")
# Inspect features
print(ds_builder.info.features)
# {'label': ClassLabel(names=['neg', 'pos']), 'text': Value('string')}
Step-by-Step Download and Prepare
from datasets import load_dataset_builder
ds_builder = load_dataset_builder("cornell-movie-review-data/rotten_tomatoes")
# Download and prepare the dataset
ds_builder.download_and_prepare()
# Access the prepared dataset
ds = ds_builder.as_dataset()
With a Packaged Builder
from datasets import load_dataset_builder
# Use the CSV builder with specific data files
ds_builder = load_dataset_builder(
"csv",
data_files={"train": "train.csv", "test": "test.csv"},
)
print(ds_builder.info)