Implementation:Huggingface Datatrove WarcReader

Knowledge Sources	datatrove WARC Format Common Crawl
Domains	Data_Ingestion, Web_Crawling
Last Updated	2026-02-14 00:00 GMT

Overview

Concrete tool for reading WARC (Web ARChive) archive files provided by the datatrove library. WarcReader extends BaseDiskReader to stream WARC records from disk or remote storage, converting each record into a Document object suitable for downstream pipeline processing.

Description

WarcReader is a pipeline reader component that processes WARC files produced by web crawlers such as Common Crawl. Each WARC record is parsed to extract the HTML text content, the source URL, the capture date, and the unique WARC-Record-ID. These fields are assembled into a Document object that flows through the rest of the datatrove pipeline.

Key capabilities include:

Compression support: Handles gzip and zstd compressed WARC files, with automatic detection via the "infer" mode
Glob patterns: Supports file glob patterns to select specific WARC files from a directory
Shard-based parallel reading: Files can be distributed across multiple workers for parallel ingestion
Configurable field mapping: The adapter, text_key, and id_key parameters control how WARC record fields map to Document fields
Record filtering: Internally filters for response and conversion record types, skipping non-content records

Usage

Import and use WarcReader when ingesting raw Common Crawl WARC dumps or any WARC-format web archive data into a datatrove pipeline. It is typically the first stage in a pipeline that includes HTML extraction, text filtering, and deduplication.

Code Reference

Source Location

Repository: datatrove
File: src/datatrove/pipeline/readers/warc.py
Lines: L11-84 (WarcReader class definition), L87-140 (process_record method)

Signature

class WarcReader(BaseDiskReader):
    def __init__(
        self,
        data_folder: DataFolderLike,
        paths_file: DataFileLike | None = None,
        compression: Literal["infer", "gzip", "zstd"] | None = "infer",
        limit: int = -1,
        skip: int = 0,
        file_progress: bool = False,
        doc_progress: bool = False,
        adapter: Callable = None,
        text_key: str = "text",
        id_key: str = "id",
        default_metadata: dict = None,
        recursive: bool = True,
        glob_pattern: str | None = None,
        shuffle_files: bool = False,
    ):

Import

from datatrove.pipeline.readers import WarcReader

I/O Contract

Inputs

Name	Type	Required	Description
data_folder	DataFolderLike	Yes	Path or data folder object pointing to WARC files
paths_file	DataFileLike	No	File listing specific WARC file paths to read
compression	Literal["infer", "gzip", "zstd"] or None	No (default: "infer")	Compression scheme for WARC files; "infer" auto-detects from file extension
limit	int	No (default: -1)	Maximum number of documents to read; -1 for unlimited
skip	int	No (default: 0)	Number of documents to skip from the beginning
file_progress	bool	No (default: False)	Show progress bar for files processed
doc_progress	bool	No (default: False)	Show progress bar for documents processed
adapter	Callable	No (default: None)	Custom function to transform raw record data before Document creation
text_key	str	No (default: "text")	Key name for the text content field in the Document
id_key	str	No (default: "id")	Key name for the document identifier field
default_metadata	dict	No (default: None)	Default metadata to attach to every Document
recursive	bool	No (default: True)	Recursively search subdirectories for WARC files
glob_pattern	str	No (default: None)	Glob pattern to filter which files to read
shuffle_files	bool	No (default: False)	Shuffle the order of files before reading

Outputs

Name	Type	Description
documents	Generator[Document]	Stream of Document objects, each containing: text - the raw HTML content from the WARC response body id - the WARC-Record-ID (a URN UUID) metadata - dictionary with `url` (WARC-Target-URI) and `date` (WARC-Date)

Usage Examples

Reading Common Crawl WARC Dumps

from datatrove.pipeline.readers import WarcReader

# Read all WARC files from a local directory
reader = WarcReader(
    data_folder="/data/common-crawl/CC-MAIN-2024-10/segments/",
    glob_pattern="*.warc.gz",
    compression="gzip",
)

# Use in a pipeline
from datatrove.executor import LocalPipelineExecutor
from datatrove.pipeline.extractors import Trafilatura

executor = LocalPipelineExecutor(
    pipeline=[
        WarcReader(
            data_folder="s3://commoncrawl/crawl-data/CC-MAIN-2024-10/segments/",
            glob_pattern="*.warc.gz",
            limit=1000,
        ),
        Trafilatura(),
    ],
    tasks=8,
)
executor.run()

Reading with Custom Adapter

from datatrove.pipeline.readers import WarcReader

def custom_adapter(self, data, path, id_in_file):
    """Custom adapter to add file path metadata."""
    return {
        "text": data["text"],
        "id": data["id"],
        "metadata": {
            **data.get("metadata", {}),
            "source_file": path,
        },
    }

reader = WarcReader(
    data_folder="/data/warc-archives/",
    adapter=custom_adapter,
    recursive=True,
)

Related Pages

Implements Principle

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment