Implementation:Datajuicer Data juicer DownloadFileMapper

Knowledge Sources	Datajuicer_Data_juicer
Domains	Data_Processing, Mapping
Last Updated	2026-02-14 16:00 GMT

Overview

Concrete tool for downloading URL files to local files or loading them into memory provided by Data-Juicer.

Description

DownloadFileMapper is a mapper operator that downloads files from URLs specified in dataset fields, saving them to local disk or loading their contents directly into memory. It uses aiohttp for asynchronous concurrent downloads (controlled by max_concurrent) with configurable timeout. It processes nested lists of URLs by flattening them for batch downloading and reconstructing the original structure afterward. It supports resume-download mode to skip already-downloaded files, and can either save files to a specified directory or store raw bytes in a sample field (defaulting to image_bytes). It operates in batched mode. It extends the Mapper base class.

Usage

Import when you need to download images, audio, or other media files from URLs before processing them with downstream operators.

Code Reference

Source Location

Repository: Datajuicer_Data_juicer
File: data_juicer/ops/mapper/download_file_mapper.py

Signature

@OPERATORS.register_module("download_file_mapper")
class DownloadFileMapper(Mapper):
    def __init__(self,
                 download_field: str = None,
                 save_dir: str = None,
                 save_field: str = None,
                 resume_download: bool = False,
                 timeout: int = 30,
                 max_concurrent: int = 10,
                 *args, **kwargs):

Import

from data_juicer.ops.mapper.download_file_mapper import DownloadFileMapper

I/O Contract

Inputs

Name	Type	Required	Description
download_field	str	No	Field name containing URLs to download. Default: None
save_dir	str	No	Directory to save downloaded files. Default: None
save_field	str	No	Field name to save downloaded file content. Default: None (falls back to image_bytes if save_dir is also None)
resume_download	bool	No	Whether to resume download and skip already-existing files. Default: False
timeout	int	No	Timeout in seconds for each download. Default: 30
max_concurrent	int	No	Maximum number of concurrent downloads. Default: 10

Outputs

Name	Type	Description
samples	Dict	Transformed samples with downloaded file paths or content stored in specified fields

Usage Examples

YAML Configuration

process:
  - download_file_mapper:
      download_field: image_url
      save_dir: ./downloaded_images
      resume_download: true
      timeout: 30
      max_concurrent: 10

Related Pages

Environment:Datajuicer_Data_juicer_Python_Runtime_Environment

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment