Implementation:Huggingface Datatrove WarcReader
| Knowledge Sources | |
|---|---|
| Domains | Data_Ingestion, Web_Crawling |
| Last Updated | 2026-02-14 00:00 GMT |
Overview
Concrete tool for reading WARC (Web ARChive) archive files provided by the datatrove library. WarcReader extends BaseDiskReader to stream WARC records from disk or remote storage, converting each record into a Document object suitable for downstream pipeline processing.
Description
WarcReader is a pipeline reader component that processes WARC files produced by web crawlers such as Common Crawl. Each WARC record is parsed to extract the HTML text content, the source URL, the capture date, and the unique WARC-Record-ID. These fields are assembled into a Document object that flows through the rest of the datatrove pipeline.
Key capabilities include:
- Compression support: Handles gzip and zstd compressed WARC files, with automatic detection via the
"infer"mode - Glob patterns: Supports file glob patterns to select specific WARC files from a directory
- Shard-based parallel reading: Files can be distributed across multiple workers for parallel ingestion
- Configurable field mapping: The
adapter,text_key, andid_keyparameters control how WARC record fields map to Document fields - Record filtering: Internally filters for
responseandconversionrecord types, skipping non-content records
Usage
Import and use WarcReader when ingesting raw Common Crawl WARC dumps or any WARC-format web archive data into a datatrove pipeline. It is typically the first stage in a pipeline that includes HTML extraction, text filtering, and deduplication.
Code Reference
Source Location
- Repository: datatrove
- File:
src/datatrove/pipeline/readers/warc.py - Lines: L11-84 (WarcReader class definition), L87-140 (process_record method)
Signature
class WarcReader(BaseDiskReader):
def __init__(
self,
data_folder: DataFolderLike,
paths_file: DataFileLike | None = None,
compression: Literal["infer", "gzip", "zstd"] | None = "infer",
limit: int = -1,
skip: int = 0,
file_progress: bool = False,
doc_progress: bool = False,
adapter: Callable = None,
text_key: str = "text",
id_key: str = "id",
default_metadata: dict = None,
recursive: bool = True,
glob_pattern: str | None = None,
shuffle_files: bool = False,
):
Import
from datatrove.pipeline.readers import WarcReader
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| data_folder | DataFolderLike | Yes | Path or data folder object pointing to WARC files |
| paths_file | DataFileLike | No | File listing specific WARC file paths to read |
| compression | Literal["infer", "gzip", "zstd"] or None | No (default: "infer") | Compression scheme for WARC files; "infer" auto-detects from file extension |
| limit | int | No (default: -1) | Maximum number of documents to read; -1 for unlimited |
| skip | int | No (default: 0) | Number of documents to skip from the beginning |
| file_progress | bool | No (default: False) | Show progress bar for files processed |
| doc_progress | bool | No (default: False) | Show progress bar for documents processed |
| adapter | Callable | No (default: None) | Custom function to transform raw record data before Document creation |
| text_key | str | No (default: "text") | Key name for the text content field in the Document |
| id_key | str | No (default: "id") | Key name for the document identifier field |
| default_metadata | dict | No (default: None) | Default metadata to attach to every Document |
| recursive | bool | No (default: True) | Recursively search subdirectories for WARC files |
| glob_pattern | str | No (default: None) | Glob pattern to filter which files to read |
| shuffle_files | bool | No (default: False) | Shuffle the order of files before reading |
Outputs
| Name | Type | Description |
|---|---|---|
| documents | Generator[Document] | Stream of Document objects, each containing:
|
Usage Examples
Reading Common Crawl WARC Dumps
from datatrove.pipeline.readers import WarcReader
# Read all WARC files from a local directory
reader = WarcReader(
data_folder="/data/common-crawl/CC-MAIN-2024-10/segments/",
glob_pattern="*.warc.gz",
compression="gzip",
)
# Use in a pipeline
from datatrove.executor import LocalPipelineExecutor
from datatrove.pipeline.extractors import Trafilatura
executor = LocalPipelineExecutor(
pipeline=[
WarcReader(
data_folder="s3://commoncrawl/crawl-data/CC-MAIN-2024-10/segments/",
glob_pattern="*.warc.gz",
limit=1000,
),
Trafilatura(),
],
tasks=8,
)
executor.run()
Reading with Custom Adapter
from datatrove.pipeline.readers import WarcReader
def custom_adapter(self, data, path, id_in_file):
"""Custom adapter to add file path metadata."""
return {
"text": data["text"],
"id": data["id"],
"metadata": {
**data.get("metadata", {}),
"source_file": path,
},
}
reader = WarcReader(
data_folder="/data/warc-archives/",
adapter=custom_adapter,
recursive=True,
)