Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Datajuicer Data juicer DownloadFileMapper

From Leeroopedia
Knowledge Sources
Domains Data_Processing, Mapping
Last Updated 2026-02-14 16:00 GMT

Overview

Concrete tool for downloading URL files to local files or loading them into memory provided by Data-Juicer.

Description

DownloadFileMapper is a mapper operator that downloads files from URLs specified in dataset fields, saving them to local disk or loading their contents directly into memory. It uses aiohttp for asynchronous concurrent downloads (controlled by max_concurrent) with configurable timeout. It processes nested lists of URLs by flattening them for batch downloading and reconstructing the original structure afterward. It supports resume-download mode to skip already-downloaded files, and can either save files to a specified directory or store raw bytes in a sample field (defaulting to image_bytes). It operates in batched mode. It extends the Mapper base class.

Usage

Import when you need to download images, audio, or other media files from URLs before processing them with downstream operators.

Code Reference

Source Location

Signature

@OPERATORS.register_module("download_file_mapper")
class DownloadFileMapper(Mapper):
    def __init__(self,
                 download_field: str = None,
                 save_dir: str = None,
                 save_field: str = None,
                 resume_download: bool = False,
                 timeout: int = 30,
                 max_concurrent: int = 10,
                 *args, **kwargs):

Import

from data_juicer.ops.mapper.download_file_mapper import DownloadFileMapper

I/O Contract

Inputs

Name Type Required Description
download_field str No Field name containing URLs to download. Default: None
save_dir str No Directory to save downloaded files. Default: None
save_field str No Field name to save downloaded file content. Default: None (falls back to image_bytes if save_dir is also None)
resume_download bool No Whether to resume download and skip already-existing files. Default: False
timeout int No Timeout in seconds for each download. Default: 30
max_concurrent int No Maximum number of concurrent downloads. Default: 10

Outputs

Name Type Description
samples Dict Transformed samples with downloaded file paths or content stored in specified fields

Usage Examples

YAML Configuration

process:
  - download_file_mapper:
      download_field: image_url
      save_dir: ./downloaded_images
      resume_download: true
      timeout: 30
      max_concurrent: 10

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment