Implementation:Huggingface Datatrove HttpMediaFetcher
| Knowledge Sources | |
|---|---|
| Domains | Media Processing, Web Scraping |
| Last Updated | 2026-02-14 17:00 GMT |
Overview
HTTPFetchReader is a pipeline step that downloads media content from URLs via HTTP/HTTPS using multi-threaded concurrent fetching with retry logic, robots.txt compliance, and configurable timeouts.
Description
HTTPFetchReader processes documents in the pipeline by fetching the media referenced in each document's media list. It uses a ThreadPoolExecutor with a configurable number of workers (default 10) to download media concurrently.
Each worker thread maintains its own requests.Session with a custom SSL context that accepts all cipher suites (including weak ones) and disables certificate verification for maximum compatibility with diverse web servers. The custom CustomHTTPAdapter configures connection pooling with configurable pool size and connections.
The fetch_media method implements exponential backoff retries (up to max_retries attempts, default 3), checks robots.txt compliance before fetching, enforces per-download timeouts, and caps downloaded content at a configurable max_size (default 1GB). On SSL errors, it automatically falls back from HTTPS to HTTP. The method handles various failure modes including connection errors, request timeouts, download timeouts, and general exceptions.
Results are processed via process_record_result, which updates each media object with the fetched bytes and response metadata, tracking statistics for success, failure, timeout, truncation, and robots.txt disallowance. The module also supports optional custom DNS resolution via the dnspython library for environments with non-standard DNS configurations.
Usage
Use HTTPFetchReader when processing documents that contain references to media (images, files) that need to be downloaded, such as when enriching web-crawled data with actual media content.
Code Reference
Source Location
- Repository: Huggingface_Datatrove
- File: src/datatrove/pipeline/media/readers/http_fetch.py
- Lines: 1-356
Signature
class HTTPFetchReader(PipelineStep):
def __init__(
self,
retry_codes: list[int] = [403, 408, 429, 500, 502, 503, 504],
timeout: tuple[int, int] = (60, 600),
workers: int = 10,
retry_delay: int = 2,
max_retries: int = 3,
download_timeout: int = 10,
max_size: int = 1024 * 1024 * 1024,
dns_port: int | None = None,
pool_size: int = 5,
pool_connections: int = 5,
custom_agent: str = "HF-Research/1.0",
):
Import
from datatrove.pipeline.media.readers.http_fetch import HTTPFetchReader
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| retry_codes | list[int] | No | HTTP status codes to retry on (default: [403, 408, 429, 500, 502, 503, 504]) |
| timeout | tuple[int, int] | No | Connection and read timeout in seconds (default: (60, 600)) |
| workers | int | No | Number of concurrent download threads (default: 10) |
| retry_delay | int | No | Base delay in seconds for exponential backoff (default: 2) |
| max_retries | int | No | Maximum number of retry attempts per URL (default: 3) |
| download_timeout | int | No | Maximum seconds for a single download (default: 10) |
| max_size | int | No | Maximum download size in bytes (default: 1GB) |
| dns_port | int | No | Custom DNS resolver port (None to use system default) |
| pool_size | int | No | Connection pool max size per adapter (default: 5) |
| pool_connections | int | No | Number of pool connections per adapter (default: 5) |
| custom_agent | str | No | User-Agent string for HTTP requests (default: "HF-Research/1.0") |
Outputs
| Name | Type | Description |
|---|---|---|
| Documents | DocumentsPipeline | Documents with media objects populated with fetched bytes and metadata |
Usage Examples
Basic Usage
from datatrove.pipeline.media.readers.http_fetch import HTTPFetchReader
fetcher = HTTPFetchReader(
workers=20,
max_retries=3,
timeout=(30, 300),
max_size=50 * 1024 * 1024, # 50MB max
custom_agent="MyBot/1.0",
)
# Use in a pipeline
pipeline = [
# ... reader step that produces documents with media URLs ...
fetcher,
# ... further processing ...
]