Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Huggingface Datatrove HuggingFaceDatasetReader

From Leeroopedia
Knowledge Sources
Domains Data_Ingestion, NLP_Data_Processing, ML_Infrastructure
Last Updated 2026-02-14 00:00 GMT

Overview

Concrete tool for reading HuggingFace datasets provided by the datatrove library. HuggingFaceDatasetReader extends BaseReader to load datasets from the HuggingFace Hub (or from disk) and convert each row into a Document object for pipeline processing.

Description

HuggingFaceDatasetReader is a pipeline reader component that integrates the HuggingFace datasets library into the datatrove pipeline framework. It loads a named dataset from the Hub (or a local dataset saved with Dataset.save_to_disk()) and iterates over its rows, converting each row into a Document object with text content, an identifier, and metadata.

Key capabilities include:

  • Hub integration: Loads any dataset from the HuggingFace Hub by name (e.g., "wikipedia", "allenai/c4")
  • Streaming support: When streaming=True, data is fetched on-demand without downloading the entire dataset
  • Batched iteration: Reads rows in configurable batches (default 1000) to optimize I/O throughput
  • Dataset options passthrough: The dataset_options dictionary allows specifying split, configuration, revision, and other load_dataset parameters
  • Local disk loading: When load_from_disk=True, reads datasets previously saved with Dataset.save_to_disk() instead of downloading from the Hub
  • Configurable field mapping: The text_key and id_key parameters control which dataset columns map to Document fields

Unlike the disk-based readers (WarcReader, JsonlReader), HuggingFaceDatasetReader extends BaseReader rather than BaseDiskReader, since it delegates file handling entirely to the datasets library.

Usage

Import and use HuggingFaceDatasetReader when using HuggingFace Hub datasets as pipeline input for tokenization, filtering, deduplication, or inference tasks. It is the preferred reader when working with datasets already hosted on the Hub.

Code Reference

Source Location

  • Repository: datatrove
  • File: src/datatrove/pipeline/readers/huggingface.py
  • Lines: L11-144 (HuggingFaceDatasetReader class and methods)

Signature

class HuggingFaceDatasetReader(BaseReader):
    def __init__(
        self,
        dataset: str,
        dataset_options: dict | None = None,
        streaming: bool = False,
        limit: int = -1,
        skip: int = 0,
        batch_size: int = 1000,
        doc_progress: bool = False,
        adapter: Callable = None,
        text_key: str = "text",
        id_key: str = "id",
        default_metadata: dict = None,
        shuffle_files: bool = False,
        load_from_disk: bool = False,
    ):

Import

from datatrove.pipeline.readers import HuggingFaceDatasetReader

I/O Contract

Inputs

Name Type Required Description
dataset str Yes HuggingFace dataset name (e.g., "wikipedia", "allenai/c4") or path to a local dataset directory
dataset_options dict No (default: None) Additional keyword arguments passed to load_dataset(), such as split, name (configuration), revision, data_dir
streaming bool No (default: False) Stream data from the Hub without downloading the full dataset
limit int No (default: -1) Maximum number of documents to read; -1 for unlimited
skip int No (default: 0) Number of documents to skip from the beginning
batch_size int No (default: 1000) Number of rows to read per batch for I/O optimization
doc_progress bool No (default: False) Show progress bar for documents processed
adapter Callable No (default: None) Custom function to transform raw row data before Document creation
text_key str No (default: "text") Dataset column name containing the document text content
id_key str No (default: "id") Dataset column name containing the document identifier
default_metadata dict No (default: None) Default metadata to attach to every Document
shuffle_files bool No (default: False) Shuffle the order of data files before reading
load_from_disk bool No (default: False) Load a dataset previously saved with Dataset.save_to_disk() instead of fetching from the Hub

Outputs

Name Type Description
documents Generator[Document] Stream of Document objects, each containing:
  • text - the text content from the dataset row's text_key column
  • id - the identifier from the row's id_key column (or an auto-generated ID)
  • metadata - remaining columns from the dataset row plus any default_metadata

Usage Examples

Loading a Public Hub Dataset

from datatrove.pipeline.readers import HuggingFaceDatasetReader

# Read the English Wikipedia dataset
reader = HuggingFaceDatasetReader(
    dataset="wikipedia",
    dataset_options={"name": "20231101.en", "split": "train"},
    streaming=True,
    text_key="text",
    id_key="id",
)

Using in a Pipeline with Streaming

from datatrove.pipeline.readers import HuggingFaceDatasetReader
from datatrove.pipeline.filters import LambdaFilter
from datatrove.executor import LocalPipelineExecutor

executor = LocalPipelineExecutor(
    pipeline=[
        HuggingFaceDatasetReader(
            dataset="allenai/c4",
            dataset_options={"name": "en", "split": "train"},
            streaming=True,
            batch_size=2000,
            limit=100000,
        ),
        LambdaFilter(lambda doc: len(doc.text) > 100),
    ],
    tasks=4,
)
executor.run()

Loading from Local Disk

from datatrove.pipeline.readers import HuggingFaceDatasetReader

# Read a dataset previously saved with Dataset.save_to_disk()
reader = HuggingFaceDatasetReader(
    dataset="/data/saved-datasets/my-corpus/",
    load_from_disk=True,
    text_key="content",
    id_key="doc_id",
    default_metadata={"source": "local"},
)

Related Pages

Implements Principle

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment