Implementation:Datajuicer Data juicer Downloader

Knowledge Sources	Datajuicer_Data_juicer
Domains	Data_Acquisition
Last Updated	2026-02-14 16:00 GMT

Overview

Concrete tool for downloading, iterating, and extracting text from remote data sources provided by Data-Juicer.

Description

DocumentDownloader, DocumentIterator, and DocumentExtractor are abstract base classes that define the download-iterate-extract pattern for acquiring raw training data. The download_and_extract function coordinates these three components: for each URL, it downloads the file, iterates over records, extracts text, collects results into a pandas DataFrame, writes partitions to disk, and returns a HuggingFace Dataset. Helper functions get_wikipedia_urls scrapes Wikimedia dump indexes to retrieve all dump URLs, get_arxiv_urls uses s5cmd to list arxiv S3 bucket contents, and validate_snapshot_format validates snapshot format strings.

Usage

Use when implementing source-specific data downloaders (e.g. Wikipedia, arxiv) that follow the consistent download-iterate-extract pattern for acquiring raw training data from remote sources.

Code Reference

Source Location

Repository: Datajuicer_Data_juicer
File: data_juicer/download/downloader.py

Signature

class DocumentDownloader(ABC):
    @abstractmethod
    def download(self, url):

class DocumentIterator(ABC):
    @abstractmethod
    def iterate(self, file_path):

class DocumentExtractor(ABC):
    @abstractmethod
    def extract(self, content):

def download_and_extract(
    urls: List[str],
    output_paths: List[str],
    downloader: DocumentDownloader,
    iterator: DocumentIterator,
    extractor: DocumentExtractor,
    output_format: dict,
    output_type: str = "jsonl",
    keep_raw_download=False,
    force_download=False,
    input_meta: Union[str, dict] = None,
    item_limit=None,
) -> Dataset:

def get_wikipedia_urls(
    language="en",
    wikidumps_index_prefix="https://dumps.wikimedia.org",
    dump_date: Optional[str] = None,
) -> List[str]:

def get_arxiv_urls():

def validate_snapshot_format(snapshot: Optional[str]) -> None:

Import

from data_juicer.download.downloader import (
    DocumentDownloader, DocumentIterator, DocumentExtractor,
    download_and_extract, get_wikipedia_urls, get_arxiv_urls
)

I/O Contract

Inputs

Name	Type	Required	Description
urls	List[str]	Yes	List of URLs to download data from
output_paths	List[str]	Yes	List of paths to save the final extracted output
downloader	DocumentDownloader	Yes	A downloader that retrieves files from URLs
iterator	DocumentIterator	Yes	An iterator that reads records from downloaded files
extractor	DocumentExtractor	Yes	An extractor that converts raw records to text
output_format	dict	Yes	Dictionary mapping column names to data types
output_type	str	No	File type to save the dataset as. Default: "jsonl"
keep_raw_download	bool	No	Whether to keep the pre-extracted download file. Default: False
force_download	bool	No	Whether to re-download existing files. Default: False
item_limit	int	No	Limit on number of items downloaded per URL

Outputs

Name	Type	Description
dataset	Dataset	A HuggingFace Dataset containing all downloaded and extracted data
urls	List[str]	List of Wikipedia dump or arxiv source URLs (from helper functions)

Usage Examples

from data_juicer.download.downloader import (
    download_and_extract, get_wikipedia_urls
)

# Get Wikipedia dump URLs
wiki_urls = get_wikipedia_urls(language="en", dump_date="20240101")

# Download and extract with custom implementations
dataset = download_and_extract(
    urls=wiki_urls[:5],
    output_paths=["output/wiki_{}.jsonl".format(i) for i in range(5)],
    downloader=my_downloader,
    iterator=my_iterator,
    extractor=my_extractor,
    output_format={"text": str, "title": str},
    output_type="jsonl"
)

Related Pages

Environment:Datajuicer_Data_juicer_Python_Runtime_Environment

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment