Implementation:NVIDIA NeMo Curator ArXiv Downloader
| Knowledge Sources | |
|---|---|
| Domains | Data Acquisition, ArXiv, S3 |
| Last Updated | 2026-02-14 00:00 GMT |
Overview
ArxivDownloader is a concrete implementation of DocumentDownloader that downloads ArXiv source tar files from the public S3 bucket s3://arxiv/src/ using the s5cmd command-line tool.
Description
The ArxivDownloader class extends DocumentDownloader to provide ArXiv-specific download functionality. It uses s5cmd, a high-performance S3 transfer tool, with the --request-payer=requester flag to copy tar files from the ArXiv S3 bucket to a local directory. During initialization, it validates that s5cmd is installed on the system, raising a RuntimeError if it is not found.
The downloader inherits atomic download behavior from the base class: files are first downloaded to a temporary .tmp path, then atomically renamed to the final location upon success. If a file already exists and is non-empty, the download is skipped.
Usage
Use ArxivDownloader when building a data acquisition pipeline that needs to download raw ArXiv LaTeX source archives from Amazon S3. It is typically used as the downloader component within an DocumentDownloadExtractStage pipeline alongside ArxivUrlGenerator, ArxivIterator, and ArxivExtractor.
Code Reference
Source Location
- Repository: NeMo-Curator
- File:
nemo_curator/stages/text/download/arxiv/download.py - Lines: 1-61
Signature
class ArxivDownloader(DocumentDownloader):
"""Downloads Arxiv data from s3://arxiv/src/"""
def __init__(self, download_dir: str, verbose: bool = False):
...
def _get_output_filename(self, url: str) -> str:
...
def _download_to_path(self, url: str, path: str) -> tuple[bool, str | None]:
...
Import
from nemo_curator.stages.text.download.arxiv.download import ArxivDownloader
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| download_dir | str |
Yes | Directory path where downloaded files will be stored |
| verbose | bool |
No | If True, logs detailed download information (default: False) |
Outputs
| Name | Type | Description |
|---|---|---|
| return value | None | Path to the downloaded file on success, or None on failure (via inherited download() method)
|
Key Methods
_get_output_filename
Returns the URL directly as the output filename, since ArXiv URLs are already tar file names (e.g., arXiv_src_2301_001.tar).
_download_to_path
Constructs the full S3 path by joining s3://arxiv/src with the URL, then invokes s5cmd as a subprocess to copy the file:
cmd = ["s5cmd", "--request-payer=requester", "cp", s3path, path]
Returns a tuple of (success: bool, error_message: str | None). Subprocess output is suppressed unless verbose mode is enabled.
Usage Examples
Basic Usage
from nemo_curator.stages.text.download.arxiv.download import ArxivDownloader
downloader = ArxivDownloader(download_dir="/data/arxiv/raw", verbose=True)
# Download a single tar file
result_path = downloader.download("arXiv_src_2301_001.tar")
if result_path:
print(f"Downloaded to: {result_path}")
As Part of a Pipeline
from nemo_curator.stages.text.download.base.download import DocumentDownloadStage
from nemo_curator.stages.text.download.arxiv.download import ArxivDownloader
downloader = ArxivDownloader(download_dir="/data/arxiv/raw")
download_stage = DocumentDownloadStage(downloader=downloader)
Dependencies
- s5cmd: Must be installed on the system. Available from https://github.com/peak/s5cmd.
- AWS credentials: Required for requester-pays access to the ArXiv S3 bucket.