Implementation:NVIDIA NeMo Curator ArXiv Downloader

Knowledge Sources	NVIDIA NeMo Curator
Domains	Data Acquisition, ArXiv, S3
Last Updated	2026-02-14 00:00 GMT

Overview

ArxivDownloader is a concrete implementation of DocumentDownloader that downloads ArXiv source tar files from the public S3 bucket s3://arxiv/src/ using the s5cmd command-line tool.

Description

The ArxivDownloader class extends DocumentDownloader to provide ArXiv-specific download functionality. It uses s5cmd, a high-performance S3 transfer tool, with the --request-payer=requester flag to copy tar files from the ArXiv S3 bucket to a local directory. During initialization, it validates that s5cmd is installed on the system, raising a RuntimeError if it is not found.

The downloader inherits atomic download behavior from the base class: files are first downloaded to a temporary .tmp path, then atomically renamed to the final location upon success. If a file already exists and is non-empty, the download is skipped.

Usage

Use ArxivDownloader when building a data acquisition pipeline that needs to download raw ArXiv LaTeX source archives from Amazon S3. It is typically used as the downloader component within an DocumentDownloadExtractStage pipeline alongside ArxivUrlGenerator, ArxivIterator, and ArxivExtractor.

Code Reference

Source Location

Repository: NeMo-Curator
File: nemo_curator/stages/text/download/arxiv/download.py
Lines: 1-61

Signature

class ArxivDownloader(DocumentDownloader):
    """Downloads Arxiv data from s3://arxiv/src/"""

    def __init__(self, download_dir: str, verbose: bool = False):
        ...

    def _get_output_filename(self, url: str) -> str:
        ...

    def _download_to_path(self, url: str, path: str) -> tuple[bool, str | None]:
        ...

Import

from nemo_curator.stages.text.download.arxiv.download import ArxivDownloader

I/O Contract

Inputs

Name	Type	Required	Description
download_dir	`str`	Yes	Directory path where downloaded files will be stored
verbose	`bool`	No	If True, logs detailed download information (default: False)

Outputs

Name	Type	Description
return value	None	Path to the downloaded file on success, or None on failure (via inherited `download()` method)

Key Methods

_get_output_filename

Returns the URL directly as the output filename, since ArXiv URLs are already tar file names (e.g., arXiv_src_2301_001.tar).

_download_to_path

Constructs the full S3 path by joining s3://arxiv/src with the URL, then invokes s5cmd as a subprocess to copy the file:

cmd = ["s5cmd", "--request-payer=requester", "cp", s3path, path]

Returns a tuple of (success: bool, error_message: str | None). Subprocess output is suppressed unless verbose mode is enabled.

Usage Examples

Basic Usage

from nemo_curator.stages.text.download.arxiv.download import ArxivDownloader

downloader = ArxivDownloader(download_dir="/data/arxiv/raw", verbose=True)

# Download a single tar file
result_path = downloader.download("arXiv_src_2301_001.tar")
if result_path:
    print(f"Downloaded to: {result_path}")

As Part of a Pipeline

from nemo_curator.stages.text.download.base.download import DocumentDownloadStage
from nemo_curator.stages.text.download.arxiv.download import ArxivDownloader

downloader = ArxivDownloader(download_dir="/data/arxiv/raw")
download_stage = DocumentDownloadStage(downloader=downloader)

Dependencies

s5cmd: Must be installed on the system. Available from https://github.com/peak/s5cmd.
AWS credentials: Required for requester-pays access to the ArXiv S3 bucket.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment