Implementation:Datajuicer Data juicer DownloadFileMapper
| Knowledge Sources | |
|---|---|
| Domains | Data_Processing, Mapping |
| Last Updated | 2026-02-14 16:00 GMT |
Overview
Concrete tool for downloading URL files to local files or loading them into memory provided by Data-Juicer.
Description
DownloadFileMapper is a mapper operator that downloads files from URLs specified in dataset fields, saving them to local disk or loading their contents directly into memory. It uses aiohttp for asynchronous concurrent downloads (controlled by max_concurrent) with configurable timeout. It processes nested lists of URLs by flattening them for batch downloading and reconstructing the original structure afterward. It supports resume-download mode to skip already-downloaded files, and can either save files to a specified directory or store raw bytes in a sample field (defaulting to image_bytes). It operates in batched mode. It extends the Mapper base class.
Usage
Import when you need to download images, audio, or other media files from URLs before processing them with downstream operators.
Code Reference
Source Location
- Repository: Datajuicer_Data_juicer
- File: data_juicer/ops/mapper/download_file_mapper.py
Signature
@OPERATORS.register_module("download_file_mapper")
class DownloadFileMapper(Mapper):
def __init__(self,
download_field: str = None,
save_dir: str = None,
save_field: str = None,
resume_download: bool = False,
timeout: int = 30,
max_concurrent: int = 10,
*args, **kwargs):
Import
from data_juicer.ops.mapper.download_file_mapper import DownloadFileMapper
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| download_field | str | No | Field name containing URLs to download. Default: None |
| save_dir | str | No | Directory to save downloaded files. Default: None |
| save_field | str | No | Field name to save downloaded file content. Default: None (falls back to image_bytes if save_dir is also None) |
| resume_download | bool | No | Whether to resume download and skip already-existing files. Default: False |
| timeout | int | No | Timeout in seconds for each download. Default: 30 |
| max_concurrent | int | No | Maximum number of concurrent downloads. Default: 10 |
Outputs
| Name | Type | Description |
|---|---|---|
| samples | Dict | Transformed samples with downloaded file paths or content stored in specified fields |
Usage Examples
YAML Configuration
process:
- download_file_mapper:
download_field: image_url
save_dir: ./downloaded_images
resume_download: true
timeout: 30
max_concurrent: 10