Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Huggingface Datatrove WarcMediaReader

From Leeroopedia
Knowledge Sources
Domains Media Processing, Web Archive Processing
Last Updated 2026-02-14 17:00 GMT

Overview

WarcReaderFast is a threaded media reader that extracts binary media content from WARC (Web ARChive) files by seeking to specific offsets and parsing individual WARC records.

Description

The WarcReaderFast class extends BinaryReaderThreaded to implement WARC-specific media reading. Each media object is expected to carry a path (the WARC file location) and an offset (the byte position of the target WARC record within that file). The reader seeks directly to the specified offset rather than scanning the entire archive, enabling efficient random access to individual records.

The implementation makes extensive use of thread-local storage to manage file pointers across concurrent reading operations. Each worker thread maintains its own open file pointer along with the filename it corresponds to. When a subsequent read targets the same WARC file, the existing file pointer is reused, avoiding the overhead of repeatedly opening and closing files. When the target file changes, the old pointer is properly closed before opening the new file.

After seeking to the correct offset, the reader creates a new ArchiveIterator instance from the warcio library with a configurable block size derived from the media's length attribute (defaulting to 128KB). It reads exactly one WARC record and returns its content stream bytes. The block size optimization is important because knowing the record length in advance allows the decompressor to read efficiently without over-fetching.

Usage

Use WarcReaderFast when your pipeline needs to read media content stored in WARC archives, such as images or other binary resources crawled from the web. It is designed for use cases where media metadata (including file paths and byte offsets) has been pre-computed and stored alongside document records.

Code Reference

Source Location

Signature

class WarcReaderFast(BinaryReaderThreaded):
    type = "Media Reader"
    name = "🌐 - Warc Reader Fast"

    def read_media_record(self, media: Media):
        ...

Import

from datatrove.pipeline.media.media_readers.warc import WarcReaderFast

I/O Contract

Inputs

Name Type Required Description
data_folder DataFolderLike Yes Path to the folder containing WARC files (inherited from BinaryReaderThreaded)
workers int No Number of worker threads (inherited, default: 1)
preserve_order bool No Whether to preserve document order (inherited, default: False)
media.path str Yes Path to the WARC file containing this media record
media.offset int Yes Byte offset of the WARC record within the file
media.length int No Length hint for buffer sizing (default: 128KB)

Outputs

Name Type Description
content bytes or None The raw content bytes extracted from the WARC record, or None if offset/path is missing

Usage Examples

Basic Usage

from datatrove.pipeline.media.media_readers.warc import WarcReaderFast

# Read media from WARC files with 8 worker threads
warc_reader = WarcReaderFast(
    data_folder="s3://my-bucket/warc-files/",
    workers=8,
    preserve_order=False,
)

# Use in a pipeline after a document reader that provides media offsets
pipeline = [
    # ... document reader that populates media.path and media.offset ...
    warc_reader,
    # ... downstream processing ...
]

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment