Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Huggingface Datasets DatasetBuilder As Dataset

From Leeroopedia
Knowledge Sources
Domains Data_Engineering, NLP
Last Updated 2026-02-14 18:00 GMT

Overview

Concrete tool for constructing in-memory Dataset objects from cached Arrow files provided by the HuggingFace Datasets library.

Description

DatasetBuilder.as_dataset constructs Dataset or DatasetDict objects from previously prepared on-disk data. It verifies that the prepared data exists, resolves the requested split(s), delegates to _as_dataset (which uses ArrowReader to read and concatenate Arrow IPC shard files), computes a deterministic fingerprint for caching, and optionally runs post-processing steps defined by the dataset builder. When no split is specified, it returns all available splits as a DatasetDict. The method requires that data was prepared in Arrow format (not Parquet) and that the data is on local storage (not a remote filesystem).

Usage

Call as_dataset() after download_and_prepare() to obtain the final Dataset object. This is the second step in the two-phase builder pattern: prepare then load. In the common load_dataset() workflow, both phases are handled automatically, but the explicit builder API allows fine-grained control.

Code Reference

Source Location

  • Repository: datasets
  • File: src/datasets/builder.py
  • Lines: L992-L1134

Signature

def as_dataset(
    self,
    split: Optional[Union[str, Split, list[str], list[Split]]] = None,
    run_post_process=True,
    verification_mode: Optional[Union[VerificationMode, str]] = None,
    in_memory=False,
) -> Union[Dataset, DatasetDict]:

Import

from datasets import load_dataset_builder
# Access via builder instance:
builder = load_dataset_builder("dataset_name")
builder.download_and_prepare()
ds = builder.as_dataset(split="train")

I/O Contract

Inputs

Name Type Required Description
split str, Split, list[str], or list[Split] No Which subset(s) of the data to return. Supports split names ("train"), combinations ("train+test"), and slicing ("train[:10%]"). If None, returns all splits as a DatasetDict.
run_post_process bool No Whether to run post-processing transforms and/or add indexes. Defaults to True.
verification_mode VerificationMode or str No Checks to run on dataset information. Defaults to BASIC_CHECKS.
in_memory bool No Whether to copy the data into memory rather than memory-mapping from disk. Defaults to False.

Outputs

Name Type Description
(return value) Dataset or DatasetDict A single Dataset when a specific split is requested, or a DatasetDict mapping split names to Dataset objects when no split or multiple splits are requested.

Usage Examples

Basic Usage

from datasets import load_dataset_builder

builder = load_dataset_builder("cornell-movie-review-data/rotten_tomatoes")
builder.download_and_prepare()

# Load a single split
ds = builder.as_dataset(split="train")
print(ds)
# Dataset({
#     features: ['text', 'label'],
#     num_rows: 8530
# })

Loading All Splits

from datasets import load_dataset_builder

builder = load_dataset_builder("cornell-movie-review-data/rotten_tomatoes")
builder.download_and_prepare()

# Load all splits as a DatasetDict
ds_dict = builder.as_dataset()
print(ds_dict)
# DatasetDict({
#     train: Dataset({ features: ['text', 'label'], num_rows: 8530 })
#     validation: Dataset({ features: ['text', 'label'], num_rows: 1066 })
#     test: Dataset({ features: ['text', 'label'], num_rows: 1066 })
# })

In-Memory Loading

from datasets import load_dataset_builder

builder = load_dataset_builder("cornell-movie-review-data/rotten_tomatoes")
builder.download_and_prepare()

# Load data into memory for maximum access speed
ds = builder.as_dataset(split="train", in_memory=True)

Related Pages

Implements Principle

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment