Principle:Eventual Inc Daft Data Preprocessing Download
| Knowledge Sources | |
|---|---|
| Domains | Data_Engineering, Data_Preprocessing |
| Last Updated | 2026-02-08 00:00 GMT |
Overview
Technique for downloading binary content from URLs within a distributed dataframe pipeline.
Description
URL downloading enables fetching remote resources (images, files, API responses) as part of a data pipeline. Daft provides a built-in download function that treats each string in a column as a URL and retrieves the binary content in parallel.
Key design aspects include:
- Parallel downloads: Daft uses a configurable number of concurrent connections (default 32) per thread to maximize throughput when downloading many URLs.
- Adaptive runtime: For local execution, Daft uses a multi-threaded Tokio runtime to maximize parallelism. For distributed execution on Ray, it uses a single-threaded runtime per worker to avoid overwhelming storage backends with
(N_CPU * N_PROC * max_connections)simultaneous connections. - Error handling: Two modes are available:
"raise"to fail immediately on any download error, or"null"to log the error and return a null value, enabling resilient pipelines that tolerate missing resources. - Connection pooling: The underlying Rust implementation manages connection pools efficiently, reusing connections across downloads to the same host.
- Unity Catalog integration: If a Unity Catalog session is active, the download function automatically configures its IO settings to support Unity-managed storage paths.
Usage
Use this technique when you need to fetch binary content from URLs stored in a DataFrame column. Common use cases include:
- Downloading images from URL columns in ML datasets
- Fetching API responses as part of an enrichment pipeline
- Retrieving files from cloud storage paths stored in a metadata table
Theoretical Basis
The download operation follows a parallel HTTP download with connection pooling pattern:
- Column-level parallelism: Each partition of the DataFrame processes its URL column independently, with multiple concurrent connections per partition.
- Backpressure control: The
max_connectionsparameter acts as a concurrency limiter, preventing resource exhaustion when downloading from rate-limited servers. - Error resilience: The
on_errorparameter implements the circuit breaker pattern, where individual failures can be isolated (nullified) rather than failing the entire pipeline. - Adaptive concurrency: The runtime automatically adjusts between multi-threaded and single-threaded I/O based on the execution environment to balance throughput and resource consumption.
Pseudocode:
1. For each partition in the DataFrame:
a. Initialize connection pool with max_connections limit
b. For each URL in the column:
- Issue async HTTP GET request
- On success: store binary response body
- On error:
* If on_error="raise": propagate error
* If on_error="null": store null, log warning
c. Return Binary column with downloaded bytes