Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Huggingface Datatrove HttpMediaFetcher

From Leeroopedia
Knowledge Sources
Domains Media Processing, Web Scraping
Last Updated 2026-02-14 17:00 GMT

Overview

HTTPFetchReader is a pipeline step that downloads media content from URLs via HTTP/HTTPS using multi-threaded concurrent fetching with retry logic, robots.txt compliance, and configurable timeouts.

Description

HTTPFetchReader processes documents in the pipeline by fetching the media referenced in each document's media list. It uses a ThreadPoolExecutor with a configurable number of workers (default 10) to download media concurrently.

Each worker thread maintains its own requests.Session with a custom SSL context that accepts all cipher suites (including weak ones) and disables certificate verification for maximum compatibility with diverse web servers. The custom CustomHTTPAdapter configures connection pooling with configurable pool size and connections.

The fetch_media method implements exponential backoff retries (up to max_retries attempts, default 3), checks robots.txt compliance before fetching, enforces per-download timeouts, and caps downloaded content at a configurable max_size (default 1GB). On SSL errors, it automatically falls back from HTTPS to HTTP. The method handles various failure modes including connection errors, request timeouts, download timeouts, and general exceptions.

Results are processed via process_record_result, which updates each media object with the fetched bytes and response metadata, tracking statistics for success, failure, timeout, truncation, and robots.txt disallowance. The module also supports optional custom DNS resolution via the dnspython library for environments with non-standard DNS configurations.

Usage

Use HTTPFetchReader when processing documents that contain references to media (images, files) that need to be downloaded, such as when enriching web-crawled data with actual media content.

Code Reference

Source Location

Signature

class HTTPFetchReader(PipelineStep):
    def __init__(
        self,
        retry_codes: list[int] = [403, 408, 429, 500, 502, 503, 504],
        timeout: tuple[int, int] = (60, 600),
        workers: int = 10,
        retry_delay: int = 2,
        max_retries: int = 3,
        download_timeout: int = 10,
        max_size: int = 1024 * 1024 * 1024,
        dns_port: int | None = None,
        pool_size: int = 5,
        pool_connections: int = 5,
        custom_agent: str = "HF-Research/1.0",
    ):

Import

from datatrove.pipeline.media.readers.http_fetch import HTTPFetchReader

I/O Contract

Inputs

Name Type Required Description
retry_codes list[int] No HTTP status codes to retry on (default: [403, 408, 429, 500, 502, 503, 504])
timeout tuple[int, int] No Connection and read timeout in seconds (default: (60, 600))
workers int No Number of concurrent download threads (default: 10)
retry_delay int No Base delay in seconds for exponential backoff (default: 2)
max_retries int No Maximum number of retry attempts per URL (default: 3)
download_timeout int No Maximum seconds for a single download (default: 10)
max_size int No Maximum download size in bytes (default: 1GB)
dns_port int No Custom DNS resolver port (None to use system default)
pool_size int No Connection pool max size per adapter (default: 5)
pool_connections int No Number of pool connections per adapter (default: 5)
custom_agent str No User-Agent string for HTTP requests (default: "HF-Research/1.0")

Outputs

Name Type Description
Documents DocumentsPipeline Documents with media objects populated with fetched bytes and metadata

Usage Examples

Basic Usage

from datatrove.pipeline.media.readers.http_fetch import HTTPFetchReader

fetcher = HTTPFetchReader(
    workers=20,
    max_retries=3,
    timeout=(30, 300),
    max_size=50 * 1024 * 1024,  # 50MB max
    custom_agent="MyBot/1.0",
)

# Use in a pipeline
pipeline = [
    # ... reader step that produces documents with media URLs ...
    fetcher,
    # ... further processing ...
]

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment