Implementation:Huggingface Datatrove HuggingFaceDatasetReader

Knowledge Sources	datatrove HuggingFace Datasets HuggingFace Hub
Domains	Data_Ingestion, NLP_Data_Processing, ML_Infrastructure
Last Updated	2026-02-14 00:00 GMT

Overview

Concrete tool for reading HuggingFace datasets provided by the datatrove library. HuggingFaceDatasetReader extends BaseReader to load datasets from the HuggingFace Hub (or from disk) and convert each row into a Document object for pipeline processing.

Description

HuggingFaceDatasetReader is a pipeline reader component that integrates the HuggingFace datasets library into the datatrove pipeline framework. It loads a named dataset from the Hub (or a local dataset saved with Dataset.save_to_disk()) and iterates over its rows, converting each row into a Document object with text content, an identifier, and metadata.

Key capabilities include:

Hub integration: Loads any dataset from the HuggingFace Hub by name (e.g., "wikipedia", "allenai/c4")
Streaming support: When streaming=True, data is fetched on-demand without downloading the entire dataset
Batched iteration: Reads rows in configurable batches (default 1000) to optimize I/O throughput
Dataset options passthrough: The dataset_options dictionary allows specifying split, configuration, revision, and other load_dataset parameters
Local disk loading: When load_from_disk=True, reads datasets previously saved with Dataset.save_to_disk() instead of downloading from the Hub
Configurable field mapping: The text_key and id_key parameters control which dataset columns map to Document fields

Unlike the disk-based readers (WarcReader, JsonlReader), HuggingFaceDatasetReader extends BaseReader rather than BaseDiskReader, since it delegates file handling entirely to the datasets library.

Usage

Import and use HuggingFaceDatasetReader when using HuggingFace Hub datasets as pipeline input for tokenization, filtering, deduplication, or inference tasks. It is the preferred reader when working with datasets already hosted on the Hub.

Code Reference

Source Location

Repository: datatrove
File: src/datatrove/pipeline/readers/huggingface.py
Lines: L11-144 (HuggingFaceDatasetReader class and methods)

Signature

class HuggingFaceDatasetReader(BaseReader):
    def __init__(
        self,
        dataset: str,
        dataset_options: dict | None = None,
        streaming: bool = False,
        limit: int = -1,
        skip: int = 0,
        batch_size: int = 1000,
        doc_progress: bool = False,
        adapter: Callable = None,
        text_key: str = "text",
        id_key: str = "id",
        default_metadata: dict = None,
        shuffle_files: bool = False,
        load_from_disk: bool = False,
    ):

Import

from datatrove.pipeline.readers import HuggingFaceDatasetReader

I/O Contract

Inputs

Name	Type	Required	Description
dataset	str	Yes	HuggingFace dataset name (e.g., `"wikipedia"`, `"allenai/c4"`) or path to a local dataset directory
dataset_options	dict	No (default: None)	Additional keyword arguments passed to `load_dataset()`, such as `split`, `name` (configuration), `revision`, `data_dir`
streaming	bool	No (default: False)	Stream data from the Hub without downloading the full dataset
limit	int	No (default: -1)	Maximum number of documents to read; -1 for unlimited
skip	int	No (default: 0)	Number of documents to skip from the beginning
batch_size	int	No (default: 1000)	Number of rows to read per batch for I/O optimization
doc_progress	bool	No (default: False)	Show progress bar for documents processed
adapter	Callable	No (default: None)	Custom function to transform raw row data before Document creation
text_key	str	No (default: "text")	Dataset column name containing the document text content
id_key	str	No (default: "id")	Dataset column name containing the document identifier
default_metadata	dict	No (default: None)	Default metadata to attach to every Document
shuffle_files	bool	No (default: False)	Shuffle the order of data files before reading
load_from_disk	bool	No (default: False)	Load a dataset previously saved with `Dataset.save_to_disk()` instead of fetching from the Hub

Outputs

Name	Type	Description
documents	Generator[Document]	Stream of Document objects, each containing: text - the text content from the dataset row's `text_key` column id - the identifier from the row's `id_key` column (or an auto-generated ID) metadata - remaining columns from the dataset row plus any `default_metadata`

Usage Examples

Loading a Public Hub Dataset

from datatrove.pipeline.readers import HuggingFaceDatasetReader

# Read the English Wikipedia dataset
reader = HuggingFaceDatasetReader(
    dataset="wikipedia",
    dataset_options={"name": "20231101.en", "split": "train"},
    streaming=True,
    text_key="text",
    id_key="id",
)

Using in a Pipeline with Streaming

from datatrove.pipeline.readers import HuggingFaceDatasetReader
from datatrove.pipeline.filters import LambdaFilter
from datatrove.executor import LocalPipelineExecutor

executor = LocalPipelineExecutor(
    pipeline=[
        HuggingFaceDatasetReader(
            dataset="allenai/c4",
            dataset_options={"name": "en", "split": "train"},
            streaming=True,
            batch_size=2000,
            limit=100000,
        ),
        LambdaFilter(lambda doc: len(doc.text) > 100),
    ],
    tasks=4,
)
executor.run()

Loading from Local Disk

from datatrove.pipeline.readers import HuggingFaceDatasetReader

# Read a dataset previously saved with Dataset.save_to_disk()
reader = HuggingFaceDatasetReader(
    dataset="/data/saved-datasets/my-corpus/",
    load_from_disk=True,
    text_key="content",
    id_key="doc_id",
    default_metadata={"source": "local"},
)

Related Pages

Implements Principle

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment