Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Datajuicer Data juicer Downloader

From Leeroopedia
Knowledge Sources
Domains Data_Acquisition
Last Updated 2026-02-14 16:00 GMT

Overview

Concrete tool for downloading, iterating, and extracting text from remote data sources provided by Data-Juicer.

Description

DocumentDownloader, DocumentIterator, and DocumentExtractor are abstract base classes that define the download-iterate-extract pattern for acquiring raw training data. The download_and_extract function coordinates these three components: for each URL, it downloads the file, iterates over records, extracts text, collects results into a pandas DataFrame, writes partitions to disk, and returns a HuggingFace Dataset. Helper functions get_wikipedia_urls scrapes Wikimedia dump indexes to retrieve all dump URLs, get_arxiv_urls uses s5cmd to list arxiv S3 bucket contents, and validate_snapshot_format validates snapshot format strings.

Usage

Use when implementing source-specific data downloaders (e.g. Wikipedia, arxiv) that follow the consistent download-iterate-extract pattern for acquiring raw training data from remote sources.

Code Reference

Source Location

Signature

class DocumentDownloader(ABC):
    @abstractmethod
    def download(self, url):

class DocumentIterator(ABC):
    @abstractmethod
    def iterate(self, file_path):

class DocumentExtractor(ABC):
    @abstractmethod
    def extract(self, content):

def download_and_extract(
    urls: List[str],
    output_paths: List[str],
    downloader: DocumentDownloader,
    iterator: DocumentIterator,
    extractor: DocumentExtractor,
    output_format: dict,
    output_type: str = "jsonl",
    keep_raw_download=False,
    force_download=False,
    input_meta: Union[str, dict] = None,
    item_limit=None,
) -> Dataset:

def get_wikipedia_urls(
    language="en",
    wikidumps_index_prefix="https://dumps.wikimedia.org",
    dump_date: Optional[str] = None,
) -> List[str]:

def get_arxiv_urls():

def validate_snapshot_format(snapshot: Optional[str]) -> None:

Import

from data_juicer.download.downloader import (
    DocumentDownloader, DocumentIterator, DocumentExtractor,
    download_and_extract, get_wikipedia_urls, get_arxiv_urls
)

I/O Contract

Inputs

Name Type Required Description
urls List[str] Yes List of URLs to download data from
output_paths List[str] Yes List of paths to save the final extracted output
downloader DocumentDownloader Yes A downloader that retrieves files from URLs
iterator DocumentIterator Yes An iterator that reads records from downloaded files
extractor DocumentExtractor Yes An extractor that converts raw records to text
output_format dict Yes Dictionary mapping column names to data types
output_type str No File type to save the dataset as. Default: "jsonl"
keep_raw_download bool No Whether to keep the pre-extracted download file. Default: False
force_download bool No Whether to re-download existing files. Default: False
item_limit int No Limit on number of items downloaded per URL

Outputs

Name Type Description
dataset Dataset A HuggingFace Dataset containing all downloaded and extracted data
urls List[str] List of Wikipedia dump or arxiv source URLs (from helper functions)

Usage Examples

from data_juicer.download.downloader import (
    download_and_extract, get_wikipedia_urls
)

# Get Wikipedia dump URLs
wiki_urls = get_wikipedia_urls(language="en", dump_date="20240101")

# Download and extract with custom implementations
dataset = download_and_extract(
    urls=wiki_urls[:5],
    output_paths=["output/wiki_{}.jsonl".format(i) for i in range(5)],
    downloader=my_downloader,
    iterator=my_iterator,
    extractor=my_extractor,
    output_format={"text": str, "title": str},
    output_type="jsonl"
)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment