Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:NVIDIA NeMo Curator ArXiv Downloader

From Leeroopedia
Knowledge Sources
Domains Data Acquisition, ArXiv, S3
Last Updated 2026-02-14 00:00 GMT

Overview

ArxivDownloader is a concrete implementation of DocumentDownloader that downloads ArXiv source tar files from the public S3 bucket s3://arxiv/src/ using the s5cmd command-line tool.

Description

The ArxivDownloader class extends DocumentDownloader to provide ArXiv-specific download functionality. It uses s5cmd, a high-performance S3 transfer tool, with the --request-payer=requester flag to copy tar files from the ArXiv S3 bucket to a local directory. During initialization, it validates that s5cmd is installed on the system, raising a RuntimeError if it is not found.

The downloader inherits atomic download behavior from the base class: files are first downloaded to a temporary .tmp path, then atomically renamed to the final location upon success. If a file already exists and is non-empty, the download is skipped.

Usage

Use ArxivDownloader when building a data acquisition pipeline that needs to download raw ArXiv LaTeX source archives from Amazon S3. It is typically used as the downloader component within an DocumentDownloadExtractStage pipeline alongside ArxivUrlGenerator, ArxivIterator, and ArxivExtractor.

Code Reference

Source Location

  • Repository: NeMo-Curator
  • File: nemo_curator/stages/text/download/arxiv/download.py
  • Lines: 1-61

Signature

class ArxivDownloader(DocumentDownloader):
    """Downloads Arxiv data from s3://arxiv/src/"""

    def __init__(self, download_dir: str, verbose: bool = False):
        ...

    def _get_output_filename(self, url: str) -> str:
        ...

    def _download_to_path(self, url: str, path: str) -> tuple[bool, str | None]:
        ...

Import

from nemo_curator.stages.text.download.arxiv.download import ArxivDownloader

I/O Contract

Inputs

Name Type Required Description
download_dir str Yes Directory path where downloaded files will be stored
verbose bool No If True, logs detailed download information (default: False)

Outputs

Name Type Description
return value None Path to the downloaded file on success, or None on failure (via inherited download() method)

Key Methods

_get_output_filename

Returns the URL directly as the output filename, since ArXiv URLs are already tar file names (e.g., arXiv_src_2301_001.tar).

_download_to_path

Constructs the full S3 path by joining s3://arxiv/src with the URL, then invokes s5cmd as a subprocess to copy the file:

cmd = ["s5cmd", "--request-payer=requester", "cp", s3path, path]

Returns a tuple of (success: bool, error_message: str | None). Subprocess output is suppressed unless verbose mode is enabled.

Usage Examples

Basic Usage

from nemo_curator.stages.text.download.arxiv.download import ArxivDownloader

downloader = ArxivDownloader(download_dir="/data/arxiv/raw", verbose=True)

# Download a single tar file
result_path = downloader.download("arXiv_src_2301_001.tar")
if result_path:
    print(f"Downloaded to: {result_path}")

As Part of a Pipeline

from nemo_curator.stages.text.download.base.download import DocumentDownloadStage
from nemo_curator.stages.text.download.arxiv.download import ArxivDownloader

downloader = ArxivDownloader(download_dir="/data/arxiv/raw")
download_stage = DocumentDownloadStage(downloader=downloader)

Dependencies

  • s5cmd: Must be installed on the system. Available from https://github.com/peak/s5cmd.
  • AWS credentials: Required for requester-pays access to the ArXiv S3 bucket.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment