Implementation:Datajuicer Data juicer Downloader
| Knowledge Sources | |
|---|---|
| Domains | Data_Acquisition |
| Last Updated | 2026-02-14 16:00 GMT |
Overview
Concrete tool for downloading, iterating, and extracting text from remote data sources provided by Data-Juicer.
Description
DocumentDownloader, DocumentIterator, and DocumentExtractor are abstract base classes that define the download-iterate-extract pattern for acquiring raw training data. The download_and_extract function coordinates these three components: for each URL, it downloads the file, iterates over records, extracts text, collects results into a pandas DataFrame, writes partitions to disk, and returns a HuggingFace Dataset. Helper functions get_wikipedia_urls scrapes Wikimedia dump indexes to retrieve all dump URLs, get_arxiv_urls uses s5cmd to list arxiv S3 bucket contents, and validate_snapshot_format validates snapshot format strings.
Usage
Use when implementing source-specific data downloaders (e.g. Wikipedia, arxiv) that follow the consistent download-iterate-extract pattern for acquiring raw training data from remote sources.
Code Reference
Source Location
- Repository: Datajuicer_Data_juicer
- File:
data_juicer/download/downloader.py
Signature
class DocumentDownloader(ABC):
@abstractmethod
def download(self, url):
class DocumentIterator(ABC):
@abstractmethod
def iterate(self, file_path):
class DocumentExtractor(ABC):
@abstractmethod
def extract(self, content):
def download_and_extract(
urls: List[str],
output_paths: List[str],
downloader: DocumentDownloader,
iterator: DocumentIterator,
extractor: DocumentExtractor,
output_format: dict,
output_type: str = "jsonl",
keep_raw_download=False,
force_download=False,
input_meta: Union[str, dict] = None,
item_limit=None,
) -> Dataset:
def get_wikipedia_urls(
language="en",
wikidumps_index_prefix="https://dumps.wikimedia.org",
dump_date: Optional[str] = None,
) -> List[str]:
def get_arxiv_urls():
def validate_snapshot_format(snapshot: Optional[str]) -> None:
Import
from data_juicer.download.downloader import (
DocumentDownloader, DocumentIterator, DocumentExtractor,
download_and_extract, get_wikipedia_urls, get_arxiv_urls
)
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| urls | List[str] | Yes | List of URLs to download data from |
| output_paths | List[str] | Yes | List of paths to save the final extracted output |
| downloader | DocumentDownloader | Yes | A downloader that retrieves files from URLs |
| iterator | DocumentIterator | Yes | An iterator that reads records from downloaded files |
| extractor | DocumentExtractor | Yes | An extractor that converts raw records to text |
| output_format | dict | Yes | Dictionary mapping column names to data types |
| output_type | str | No | File type to save the dataset as. Default: "jsonl" |
| keep_raw_download | bool | No | Whether to keep the pre-extracted download file. Default: False |
| force_download | bool | No | Whether to re-download existing files. Default: False |
| item_limit | int | No | Limit on number of items downloaded per URL |
Outputs
| Name | Type | Description |
|---|---|---|
| dataset | Dataset | A HuggingFace Dataset containing all downloaded and extracted data |
| urls | List[str] | List of Wikipedia dump or arxiv source URLs (from helper functions) |
Usage Examples
from data_juicer.download.downloader import (
download_and_extract, get_wikipedia_urls
)
# Get Wikipedia dump URLs
wiki_urls = get_wikipedia_urls(language="en", dump_date="20240101")
# Download and extract with custom implementations
dataset = download_and_extract(
urls=wiki_urls[:5],
output_paths=["output/wiki_{}.jsonl".format(i) for i in range(5)],
downloader=my_downloader,
iterator=my_iterator,
extractor=my_extractor,
output_format={"text": str, "title": str},
output_type="jsonl"
)