Implementation:Huggingface Datatrove HttpMediaFetcher

Knowledge Sources	Huggingface_Datatrove
Domains	Media Processing, Web Scraping
Last Updated	2026-02-14 17:00 GMT

Overview

HTTPFetchReader is a pipeline step that downloads media content from URLs via HTTP/HTTPS using multi-threaded concurrent fetching with retry logic, robots.txt compliance, and configurable timeouts.

Description

HTTPFetchReader processes documents in the pipeline by fetching the media referenced in each document's media list. It uses a ThreadPoolExecutor with a configurable number of workers (default 10) to download media concurrently.

Each worker thread maintains its own requests.Session with a custom SSL context that accepts all cipher suites (including weak ones) and disables certificate verification for maximum compatibility with diverse web servers. The custom CustomHTTPAdapter configures connection pooling with configurable pool size and connections.

The fetch_media method implements exponential backoff retries (up to max_retries attempts, default 3), checks robots.txt compliance before fetching, enforces per-download timeouts, and caps downloaded content at a configurable max_size (default 1GB). On SSL errors, it automatically falls back from HTTPS to HTTP. The method handles various failure modes including connection errors, request timeouts, download timeouts, and general exceptions.

Results are processed via process_record_result, which updates each media object with the fetched bytes and response metadata, tracking statistics for success, failure, timeout, truncation, and robots.txt disallowance. The module also supports optional custom DNS resolution via the dnspython library for environments with non-standard DNS configurations.

Usage

Use HTTPFetchReader when processing documents that contain references to media (images, files) that need to be downloaded, such as when enriching web-crawled data with actual media content.

Code Reference

Source Location

Repository: Huggingface_Datatrove
File: src/datatrove/pipeline/media/readers/http_fetch.py
Lines: 1-356

Signature

class HTTPFetchReader(PipelineStep):
    def __init__(
        self,
        retry_codes: list[int] = [403, 408, 429, 500, 502, 503, 504],
        timeout: tuple[int, int] = (60, 600),
        workers: int = 10,
        retry_delay: int = 2,
        max_retries: int = 3,
        download_timeout: int = 10,
        max_size: int = 1024 * 1024 * 1024,
        dns_port: int | None = None,
        pool_size: int = 5,
        pool_connections: int = 5,
        custom_agent: str = "HF-Research/1.0",
    ):

Import

from datatrove.pipeline.media.readers.http_fetch import HTTPFetchReader

I/O Contract

Inputs

Name	Type	Required	Description
retry_codes	list[int]	No	HTTP status codes to retry on (default: [403, 408, 429, 500, 502, 503, 504])
timeout	tuple[int, int]	No	Connection and read timeout in seconds (default: (60, 600))
workers	int	No	Number of concurrent download threads (default: 10)
retry_delay	int	No	Base delay in seconds for exponential backoff (default: 2)
max_retries	int	No	Maximum number of retry attempts per URL (default: 3)
download_timeout	int	No	Maximum seconds for a single download (default: 10)
max_size	int	No	Maximum download size in bytes (default: 1GB)
dns_port	int	No	Custom DNS resolver port (None to use system default)
pool_size	int	No	Connection pool max size per adapter (default: 5)
pool_connections	int	No	Number of pool connections per adapter (default: 5)
custom_agent	str	No	User-Agent string for HTTP requests (default: "HF-Research/1.0")

Outputs

Name	Type	Description
Documents	DocumentsPipeline	Documents with media objects populated with fetched bytes and metadata

Usage Examples

Basic Usage

from datatrove.pipeline.media.readers.http_fetch import HTTPFetchReader

fetcher = HTTPFetchReader(
    workers=20,
    max_retries=3,
    timeout=(30, 300),
    max_size=50 * 1024 * 1024,  # 50MB max
    custom_agent="MyBot/1.0",
)

# Use in a pipeline
pipeline = [
    # ... reader step that produces documents with media URLs ...
    fetcher,
    # ... further processing ...
]

Related Pages

Principle:Huggingface_Datatrove_HTTP_Media_Fetching

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment