Implementation:Huggingface Datatrove WarcMediaReader

Knowledge Sources	Huggingface_Datatrove
Domains	Media Processing, Web Archive Processing
Last Updated	2026-02-14 17:00 GMT

Overview

WarcReaderFast is a threaded media reader that extracts binary media content from WARC (Web ARChive) files by seeking to specific offsets and parsing individual WARC records.

Description

The WarcReaderFast class extends BinaryReaderThreaded to implement WARC-specific media reading. Each media object is expected to carry a path (the WARC file location) and an offset (the byte position of the target WARC record within that file). The reader seeks directly to the specified offset rather than scanning the entire archive, enabling efficient random access to individual records.

The implementation makes extensive use of thread-local storage to manage file pointers across concurrent reading operations. Each worker thread maintains its own open file pointer along with the filename it corresponds to. When a subsequent read targets the same WARC file, the existing file pointer is reused, avoiding the overhead of repeatedly opening and closing files. When the target file changes, the old pointer is properly closed before opening the new file.

After seeking to the correct offset, the reader creates a new ArchiveIterator instance from the warcio library with a configurable block size derived from the media's length attribute (defaulting to 128KB). It reads exactly one WARC record and returns its content stream bytes. The block size optimization is important because knowing the record length in advance allows the decompressor to read efficiently without over-fetching.

Usage

Use WarcReaderFast when your pipeline needs to read media content stored in WARC archives, such as images or other binary resources crawled from the web. It is designed for use cases where media metadata (including file paths and byte offsets) has been pre-computed and stored alongside document records.

Code Reference

Source Location

Repository: Huggingface_Datatrove
File: src/datatrove/pipeline/media/media_readers/warc.py
Lines: 1-62

Signature

class WarcReaderFast(BinaryReaderThreaded):
    type = "Media Reader"
    name = "🌐 - Warc Reader Fast"

    def read_media_record(self, media: Media):
        ...

Import

from datatrove.pipeline.media.media_readers.warc import WarcReaderFast

I/O Contract

Inputs

Name	Type	Required	Description
data_folder	DataFolderLike	Yes	Path to the folder containing WARC files (inherited from BinaryReaderThreaded)
workers	int	No	Number of worker threads (inherited, default: 1)
preserve_order	bool	No	Whether to preserve document order (inherited, default: False)
media.path	str	Yes	Path to the WARC file containing this media record
media.offset	int	Yes	Byte offset of the WARC record within the file
media.length	int	No	Length hint for buffer sizing (default: 128KB)

Outputs

Name	Type	Description
content	bytes or None	The raw content bytes extracted from the WARC record, or None if offset/path is missing

Usage Examples

Basic Usage

from datatrove.pipeline.media.media_readers.warc import WarcReaderFast

# Read media from WARC files with 8 worker threads
warc_reader = WarcReaderFast(
    data_folder="s3://my-bucket/warc-files/",
    workers=8,
    preserve_order=False,
)

# Use in a pipeline after a document reader that provides media offsets
pipeline = [
    # ... document reader that populates media.path and media.offset ...
    warc_reader,
    # ... downstream processing ...
]

Related Pages

Principle:Huggingface_Datatrove_Media_Reading_Framework

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment