Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Huggingface Datatrove WarcReader

From Leeroopedia
Knowledge Sources
Domains Data_Ingestion, Web_Crawling
Last Updated 2026-02-14 00:00 GMT

Overview

Concrete tool for reading WARC (Web ARChive) archive files provided by the datatrove library. WarcReader extends BaseDiskReader to stream WARC records from disk or remote storage, converting each record into a Document object suitable for downstream pipeline processing.

Description

WarcReader is a pipeline reader component that processes WARC files produced by web crawlers such as Common Crawl. Each WARC record is parsed to extract the HTML text content, the source URL, the capture date, and the unique WARC-Record-ID. These fields are assembled into a Document object that flows through the rest of the datatrove pipeline.

Key capabilities include:

  • Compression support: Handles gzip and zstd compressed WARC files, with automatic detection via the "infer" mode
  • Glob patterns: Supports file glob patterns to select specific WARC files from a directory
  • Shard-based parallel reading: Files can be distributed across multiple workers for parallel ingestion
  • Configurable field mapping: The adapter, text_key, and id_key parameters control how WARC record fields map to Document fields
  • Record filtering: Internally filters for response and conversion record types, skipping non-content records

Usage

Import and use WarcReader when ingesting raw Common Crawl WARC dumps or any WARC-format web archive data into a datatrove pipeline. It is typically the first stage in a pipeline that includes HTML extraction, text filtering, and deduplication.

Code Reference

Source Location

  • Repository: datatrove
  • File: src/datatrove/pipeline/readers/warc.py
  • Lines: L11-84 (WarcReader class definition), L87-140 (process_record method)

Signature

class WarcReader(BaseDiskReader):
    def __init__(
        self,
        data_folder: DataFolderLike,
        paths_file: DataFileLike | None = None,
        compression: Literal["infer", "gzip", "zstd"] | None = "infer",
        limit: int = -1,
        skip: int = 0,
        file_progress: bool = False,
        doc_progress: bool = False,
        adapter: Callable = None,
        text_key: str = "text",
        id_key: str = "id",
        default_metadata: dict = None,
        recursive: bool = True,
        glob_pattern: str | None = None,
        shuffle_files: bool = False,
    ):

Import

from datatrove.pipeline.readers import WarcReader

I/O Contract

Inputs

Name Type Required Description
data_folder DataFolderLike Yes Path or data folder object pointing to WARC files
paths_file DataFileLike No File listing specific WARC file paths to read
compression Literal["infer", "gzip", "zstd"] or None No (default: "infer") Compression scheme for WARC files; "infer" auto-detects from file extension
limit int No (default: -1) Maximum number of documents to read; -1 for unlimited
skip int No (default: 0) Number of documents to skip from the beginning
file_progress bool No (default: False) Show progress bar for files processed
doc_progress bool No (default: False) Show progress bar for documents processed
adapter Callable No (default: None) Custom function to transform raw record data before Document creation
text_key str No (default: "text") Key name for the text content field in the Document
id_key str No (default: "id") Key name for the document identifier field
default_metadata dict No (default: None) Default metadata to attach to every Document
recursive bool No (default: True) Recursively search subdirectories for WARC files
glob_pattern str No (default: None) Glob pattern to filter which files to read
shuffle_files bool No (default: False) Shuffle the order of files before reading

Outputs

Name Type Description
documents Generator[Document] Stream of Document objects, each containing:
  • text - the raw HTML content from the WARC response body
  • id - the WARC-Record-ID (a URN UUID)
  • metadata - dictionary with url (WARC-Target-URI) and date (WARC-Date)

Usage Examples

Reading Common Crawl WARC Dumps

from datatrove.pipeline.readers import WarcReader

# Read all WARC files from a local directory
reader = WarcReader(
    data_folder="/data/common-crawl/CC-MAIN-2024-10/segments/",
    glob_pattern="*.warc.gz",
    compression="gzip",
)

# Use in a pipeline
from datatrove.executor import LocalPipelineExecutor
from datatrove.pipeline.extractors import Trafilatura

executor = LocalPipelineExecutor(
    pipeline=[
        WarcReader(
            data_folder="s3://commoncrawl/crawl-data/CC-MAIN-2024-10/segments/",
            glob_pattern="*.warc.gz",
            limit=1000,
        ),
        Trafilatura(),
    ],
    tasks=8,
)
executor.run()

Reading with Custom Adapter

from datatrove.pipeline.readers import WarcReader

def custom_adapter(self, data, path, id_in_file):
    """Custom adapter to add file path metadata."""
    return {
        "text": data["text"],
        "id": data["id"],
        "metadata": {
            **data.get("metadata", {}),
            "source_file": path,
        },
    }

reader = WarcReader(
    data_folder="/data/warc-archives/",
    adapter=custom_adapter,
    recursive=True,
)

Related Pages

Implements Principle

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment