Implementation:Huggingface Datatrove HuggingFaceDatasetReader
| Knowledge Sources | |
|---|---|
| Domains | Data_Ingestion, NLP_Data_Processing, ML_Infrastructure |
| Last Updated | 2026-02-14 00:00 GMT |
Overview
Concrete tool for reading HuggingFace datasets provided by the datatrove library. HuggingFaceDatasetReader extends BaseReader to load datasets from the HuggingFace Hub (or from disk) and convert each row into a Document object for pipeline processing.
Description
HuggingFaceDatasetReader is a pipeline reader component that integrates the HuggingFace datasets library into the datatrove pipeline framework. It loads a named dataset from the Hub (or a local dataset saved with Dataset.save_to_disk()) and iterates over its rows, converting each row into a Document object with text content, an identifier, and metadata.
Key capabilities include:
- Hub integration: Loads any dataset from the HuggingFace Hub by name (e.g.,
"wikipedia","allenai/c4") - Streaming support: When
streaming=True, data is fetched on-demand without downloading the entire dataset - Batched iteration: Reads rows in configurable batches (default 1000) to optimize I/O throughput
- Dataset options passthrough: The
dataset_optionsdictionary allows specifying split, configuration, revision, and otherload_datasetparameters - Local disk loading: When
load_from_disk=True, reads datasets previously saved withDataset.save_to_disk()instead of downloading from the Hub - Configurable field mapping: The
text_keyandid_keyparameters control which dataset columns map to Document fields
Unlike the disk-based readers (WarcReader, JsonlReader), HuggingFaceDatasetReader extends BaseReader rather than BaseDiskReader, since it delegates file handling entirely to the datasets library.
Usage
Import and use HuggingFaceDatasetReader when using HuggingFace Hub datasets as pipeline input for tokenization, filtering, deduplication, or inference tasks. It is the preferred reader when working with datasets already hosted on the Hub.
Code Reference
Source Location
- Repository: datatrove
- File:
src/datatrove/pipeline/readers/huggingface.py - Lines: L11-144 (HuggingFaceDatasetReader class and methods)
Signature
class HuggingFaceDatasetReader(BaseReader):
def __init__(
self,
dataset: str,
dataset_options: dict | None = None,
streaming: bool = False,
limit: int = -1,
skip: int = 0,
batch_size: int = 1000,
doc_progress: bool = False,
adapter: Callable = None,
text_key: str = "text",
id_key: str = "id",
default_metadata: dict = None,
shuffle_files: bool = False,
load_from_disk: bool = False,
):
Import
from datatrove.pipeline.readers import HuggingFaceDatasetReader
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| dataset | str | Yes | HuggingFace dataset name (e.g., "wikipedia", "allenai/c4") or path to a local dataset directory
|
| dataset_options | dict | No (default: None) | Additional keyword arguments passed to load_dataset(), such as split, name (configuration), revision, data_dir
|
| streaming | bool | No (default: False) | Stream data from the Hub without downloading the full dataset |
| limit | int | No (default: -1) | Maximum number of documents to read; -1 for unlimited |
| skip | int | No (default: 0) | Number of documents to skip from the beginning |
| batch_size | int | No (default: 1000) | Number of rows to read per batch for I/O optimization |
| doc_progress | bool | No (default: False) | Show progress bar for documents processed |
| adapter | Callable | No (default: None) | Custom function to transform raw row data before Document creation |
| text_key | str | No (default: "text") | Dataset column name containing the document text content |
| id_key | str | No (default: "id") | Dataset column name containing the document identifier |
| default_metadata | dict | No (default: None) | Default metadata to attach to every Document |
| shuffle_files | bool | No (default: False) | Shuffle the order of data files before reading |
| load_from_disk | bool | No (default: False) | Load a dataset previously saved with Dataset.save_to_disk() instead of fetching from the Hub
|
Outputs
| Name | Type | Description |
|---|---|---|
| documents | Generator[Document] | Stream of Document objects, each containing:
|
Usage Examples
Loading a Public Hub Dataset
from datatrove.pipeline.readers import HuggingFaceDatasetReader
# Read the English Wikipedia dataset
reader = HuggingFaceDatasetReader(
dataset="wikipedia",
dataset_options={"name": "20231101.en", "split": "train"},
streaming=True,
text_key="text",
id_key="id",
)
Using in a Pipeline with Streaming
from datatrove.pipeline.readers import HuggingFaceDatasetReader
from datatrove.pipeline.filters import LambdaFilter
from datatrove.executor import LocalPipelineExecutor
executor = LocalPipelineExecutor(
pipeline=[
HuggingFaceDatasetReader(
dataset="allenai/c4",
dataset_options={"name": "en", "split": "train"},
streaming=True,
batch_size=2000,
limit=100000,
),
LambdaFilter(lambda doc: len(doc.text) > 100),
],
tasks=4,
)
executor.run()
Loading from Local Disk
from datatrove.pipeline.readers import HuggingFaceDatasetReader
# Read a dataset previously saved with Dataset.save_to_disk()
reader = HuggingFaceDatasetReader(
dataset="/data/saved-datasets/my-corpus/",
load_from_disk=True,
text_key="content",
id_key="doc_id",
default_metadata={"source": "local"},
)