Implementation:Huggingface Datasets DatasetBuilder As Dataset

Knowledge Sources	Huggingface Datasets HF Datasets Docs
Domains	Data_Engineering, NLP
Last Updated	2026-02-14 18:00 GMT

Overview

Concrete tool for constructing in-memory Dataset objects from cached Arrow files provided by the HuggingFace Datasets library.

Description

DatasetBuilder.as_dataset constructs Dataset or DatasetDict objects from previously prepared on-disk data. It verifies that the prepared data exists, resolves the requested split(s), delegates to _as_dataset (which uses ArrowReader to read and concatenate Arrow IPC shard files), computes a deterministic fingerprint for caching, and optionally runs post-processing steps defined by the dataset builder. When no split is specified, it returns all available splits as a DatasetDict. The method requires that data was prepared in Arrow format (not Parquet) and that the data is on local storage (not a remote filesystem).

Usage

Call as_dataset() after download_and_prepare() to obtain the final Dataset object. This is the second step in the two-phase builder pattern: prepare then load. In the common load_dataset() workflow, both phases are handled automatically, but the explicit builder API allows fine-grained control.

Code Reference

Source Location

Repository: datasets
File: src/datasets/builder.py
Lines: L992-L1134

Signature

def as_dataset(
    self,
    split: Optional[Union[str, Split, list[str], list[Split]]] = None,
    run_post_process=True,
    verification_mode: Optional[Union[VerificationMode, str]] = None,
    in_memory=False,
) -> Union[Dataset, DatasetDict]:

Import

from datasets import load_dataset_builder
# Access via builder instance:
builder = load_dataset_builder("dataset_name")
builder.download_and_prepare()
ds = builder.as_dataset(split="train")

I/O Contract

Inputs

Name	Type	Required	Description
split	`str`, `Split`, `list[str]`, or `list[Split]`	No	Which subset(s) of the data to return. Supports split names (`"train"`), combinations (`"train+test"`), and slicing (`"train[:10%]"`). If `None`, returns all splits as a `DatasetDict`.
run_post_process	`bool`	No	Whether to run post-processing transforms and/or add indexes. Defaults to `True`.
verification_mode	`VerificationMode` or `str`	No	Checks to run on dataset information. Defaults to `BASIC_CHECKS`.
in_memory	`bool`	No	Whether to copy the data into memory rather than memory-mapping from disk. Defaults to `False`.

Outputs

Name	Type	Description
(return value)	`Dataset` or `DatasetDict`	A single `Dataset` when a specific split is requested, or a `DatasetDict` mapping split names to `Dataset` objects when no split or multiple splits are requested.

Usage Examples

Basic Usage

from datasets import load_dataset_builder

builder = load_dataset_builder("cornell-movie-review-data/rotten_tomatoes")
builder.download_and_prepare()

# Load a single split
ds = builder.as_dataset(split="train")
print(ds)
# Dataset({
#     features: ['text', 'label'],
#     num_rows: 8530
# })

Loading All Splits

from datasets import load_dataset_builder

builder = load_dataset_builder("cornell-movie-review-data/rotten_tomatoes")
builder.download_and_prepare()

# Load all splits as a DatasetDict
ds_dict = builder.as_dataset()
print(ds_dict)
# DatasetDict({
#     train: Dataset({ features: ['text', 'label'], num_rows: 8530 })
#     validation: Dataset({ features: ['text', 'label'], num_rows: 1066 })
#     test: Dataset({ features: ['text', 'label'], num_rows: 1066 })
# })

In-Memory Loading

from datasets import load_dataset_builder

builder = load_dataset_builder("cornell-movie-review-data/rotten_tomatoes")
builder.download_and_prepare()

# Load data into memory for maximum access speed
ds = builder.as_dataset(split="train", in_memory=True)

Related Pages

Implements Principle

Principle:Huggingface_Datasets_Dataset_Object_Construction

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment