Implementation:Huggingface Datatrove WarcMediaReader
| Knowledge Sources | |
|---|---|
| Domains | Media Processing, Web Archive Processing |
| Last Updated | 2026-02-14 17:00 GMT |
Overview
WarcReaderFast is a threaded media reader that extracts binary media content from WARC (Web ARChive) files by seeking to specific offsets and parsing individual WARC records.
Description
The WarcReaderFast class extends BinaryReaderThreaded to implement WARC-specific media reading. Each media object is expected to carry a path (the WARC file location) and an offset (the byte position of the target WARC record within that file). The reader seeks directly to the specified offset rather than scanning the entire archive, enabling efficient random access to individual records.
The implementation makes extensive use of thread-local storage to manage file pointers across concurrent reading operations. Each worker thread maintains its own open file pointer along with the filename it corresponds to. When a subsequent read targets the same WARC file, the existing file pointer is reused, avoiding the overhead of repeatedly opening and closing files. When the target file changes, the old pointer is properly closed before opening the new file.
After seeking to the correct offset, the reader creates a new ArchiveIterator instance from the warcio library with a configurable block size derived from the media's length attribute (defaulting to 128KB). It reads exactly one WARC record and returns its content stream bytes. The block size optimization is important because knowing the record length in advance allows the decompressor to read efficiently without over-fetching.
Usage
Use WarcReaderFast when your pipeline needs to read media content stored in WARC archives, such as images or other binary resources crawled from the web. It is designed for use cases where media metadata (including file paths and byte offsets) has been pre-computed and stored alongside document records.
Code Reference
Source Location
- Repository: Huggingface_Datatrove
- File: src/datatrove/pipeline/media/media_readers/warc.py
- Lines: 1-62
Signature
class WarcReaderFast(BinaryReaderThreaded):
type = "Media Reader"
name = "🌐 - Warc Reader Fast"
def read_media_record(self, media: Media):
...
Import
from datatrove.pipeline.media.media_readers.warc import WarcReaderFast
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| data_folder | DataFolderLike | Yes | Path to the folder containing WARC files (inherited from BinaryReaderThreaded) |
| workers | int | No | Number of worker threads (inherited, default: 1) |
| preserve_order | bool | No | Whether to preserve document order (inherited, default: False) |
| media.path | str | Yes | Path to the WARC file containing this media record |
| media.offset | int | Yes | Byte offset of the WARC record within the file |
| media.length | int | No | Length hint for buffer sizing (default: 128KB) |
Outputs
| Name | Type | Description |
|---|---|---|
| content | bytes or None | The raw content bytes extracted from the WARC record, or None if offset/path is missing |
Usage Examples
Basic Usage
from datatrove.pipeline.media.media_readers.warc import WarcReaderFast
# Read media from WARC files with 8 worker threads
warc_reader = WarcReaderFast(
data_folder="s3://my-bucket/warc-files/",
workers=8,
preserve_order=False,
)
# Use in a pipeline after a document reader that provides media offsets
pipeline = [
# ... document reader that populates media.path and media.offset ...
warc_reader,
# ... downstream processing ...
]