Implementation:Huggingface Datasets DatasetBuilder As Dataset
| Knowledge Sources | |
|---|---|
| Domains | Data_Engineering, NLP |
| Last Updated | 2026-02-14 18:00 GMT |
Overview
Concrete tool for constructing in-memory Dataset objects from cached Arrow files provided by the HuggingFace Datasets library.
Description
DatasetBuilder.as_dataset constructs Dataset or DatasetDict objects from previously prepared on-disk data. It verifies that the prepared data exists, resolves the requested split(s), delegates to _as_dataset (which uses ArrowReader to read and concatenate Arrow IPC shard files), computes a deterministic fingerprint for caching, and optionally runs post-processing steps defined by the dataset builder. When no split is specified, it returns all available splits as a DatasetDict. The method requires that data was prepared in Arrow format (not Parquet) and that the data is on local storage (not a remote filesystem).
Usage
Call as_dataset() after download_and_prepare() to obtain the final Dataset object. This is the second step in the two-phase builder pattern: prepare then load. In the common load_dataset() workflow, both phases are handled automatically, but the explicit builder API allows fine-grained control.
Code Reference
Source Location
- Repository: datasets
- File:
src/datasets/builder.py - Lines: L992-L1134
Signature
def as_dataset(
self,
split: Optional[Union[str, Split, list[str], list[Split]]] = None,
run_post_process=True,
verification_mode: Optional[Union[VerificationMode, str]] = None,
in_memory=False,
) -> Union[Dataset, DatasetDict]:
Import
from datasets import load_dataset_builder
# Access via builder instance:
builder = load_dataset_builder("dataset_name")
builder.download_and_prepare()
ds = builder.as_dataset(split="train")
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| split | str, Split, list[str], or list[Split] |
No | Which subset(s) of the data to return. Supports split names ("train"), combinations ("train+test"), and slicing ("train[:10%]"). If None, returns all splits as a DatasetDict.
|
| run_post_process | bool |
No | Whether to run post-processing transforms and/or add indexes. Defaults to True.
|
| verification_mode | VerificationMode or str |
No | Checks to run on dataset information. Defaults to BASIC_CHECKS.
|
| in_memory | bool |
No | Whether to copy the data into memory rather than memory-mapping from disk. Defaults to False.
|
Outputs
| Name | Type | Description |
|---|---|---|
| (return value) | Dataset or DatasetDict |
A single Dataset when a specific split is requested, or a DatasetDict mapping split names to Dataset objects when no split or multiple splits are requested.
|
Usage Examples
Basic Usage
from datasets import load_dataset_builder
builder = load_dataset_builder("cornell-movie-review-data/rotten_tomatoes")
builder.download_and_prepare()
# Load a single split
ds = builder.as_dataset(split="train")
print(ds)
# Dataset({
# features: ['text', 'label'],
# num_rows: 8530
# })
Loading All Splits
from datasets import load_dataset_builder
builder = load_dataset_builder("cornell-movie-review-data/rotten_tomatoes")
builder.download_and_prepare()
# Load all splits as a DatasetDict
ds_dict = builder.as_dataset()
print(ds_dict)
# DatasetDict({
# train: Dataset({ features: ['text', 'label'], num_rows: 8530 })
# validation: Dataset({ features: ['text', 'label'], num_rows: 1066 })
# test: Dataset({ features: ['text', 'label'], num_rows: 1066 })
# })
In-Memory Loading
from datasets import load_dataset_builder
builder = load_dataset_builder("cornell-movie-review-data/rotten_tomatoes")
builder.download_and_prepare()
# Load data into memory for maximum access speed
ds = builder.as_dataset(split="train", in_memory=True)