Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:NVIDIA NeMo Curator ArXiv URLGenerator

From Leeroopedia
Knowledge Sources
Domains Data Acquisition, ArXiv, S3
Last Updated 2026-02-14 00:00 GMT

Overview

ArxivUrlGenerator is a concrete implementation of URLGenerator that discovers all available ArXiv source tar files on S3 by listing the s3://arxiv/src/ bucket using s5cmd.

Description

The ArxivUrlGenerator class is a dataclass that extends URLGenerator. It runs the s5cmd command-line tool to list the contents of the ArXiv S3 bucket, filters for .tar files, parses the filenames from the listing output, sorts them alphabetically, and returns the sorted list. The command uses the --request-payer=requester flag since the ArXiv S3 bucket is a requester-pays bucket.

The S3 listing output is parsed by splitting on whitespace and taking every 4th element starting from index 3 (which corresponds to the filename column in s5cmd ls output), then filtering for entries containing .tar.

Usage

Use ArxivUrlGenerator as the URL generation component in an ArXiv download pipeline. It provides the initial list of tar file URLs that are then passed to ArxivDownloader for downloading.

Code Reference

Source Location

  • Repository: NeMo-Curator
  • File: nemo_curator/stages/text/download/arxiv/url_generation.py
  • Lines: 1-40

Signature

@dataclass
class ArxivUrlGenerator(URLGenerator):
    """Generates URLs for Arxiv data."""

    def generate_urls(self) -> list[str]:
        ...

    def _get_arxiv_urls(self) -> list[str]:
        ...

Import

from nemo_curator.stages.text.download.arxiv.url_generation import ArxivUrlGenerator

I/O Contract

Inputs

Name Type Required Description
(none) - - No constructor parameters; the S3 bucket path is hardcoded

Outputs

Name Type Description
return value list[str] Sorted list of ArXiv tar file names available on S3 (e.g., ["arXiv_src_0001_001.tar", "arXiv_src_0001_002.tar", ...])

Key Methods

generate_urls

Public interface method (from URLGenerator). Delegates to _get_arxiv_urls().

_get_arxiv_urls

Runs the following shell command:

s5cmd --request-payer=requester ls s3://arxiv/src/ | grep '.tar'

Parses the output to extract tar filenames, sorts them, and returns the list. Raises a RuntimeError if the command fails (non-zero return code).

Usage Examples

Basic Usage

from nemo_curator.stages.text.download.arxiv.url_generation import ArxivUrlGenerator

url_gen = ArxivUrlGenerator()
urls = url_gen.generate_urls()

print(f"Found {len(urls)} ArXiv tar files")
for url in urls[:5]:
    print(f"  {url}")

With URL Limit in Pipeline

from nemo_curator.stages.text.download.base.url_generation import URLGenerationStage
from nemo_curator.stages.text.download.arxiv.url_generation import ArxivUrlGenerator

url_stage = URLGenerationStage(
    url_generator=ArxivUrlGenerator(),
    limit=10,  # Only process the first 10 tar files
)

Dependencies

  • s5cmd: Must be installed on the system. Available from https://github.com/peak/s5cmd.
  • AWS credentials: Required for requester-pays access to the ArXiv S3 bucket.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment