Implementation:NVIDIA NeMo Curator ArXiv URLGenerator

Knowledge Sources	NVIDIA NeMo Curator
Domains	Data Acquisition, ArXiv, S3
Last Updated	2026-02-14 00:00 GMT

Overview

ArxivUrlGenerator is a concrete implementation of URLGenerator that discovers all available ArXiv source tar files on S3 by listing the s3://arxiv/src/ bucket using s5cmd.

Description

The ArxivUrlGenerator class is a dataclass that extends URLGenerator. It runs the s5cmd command-line tool to list the contents of the ArXiv S3 bucket, filters for .tar files, parses the filenames from the listing output, sorts them alphabetically, and returns the sorted list. The command uses the --request-payer=requester flag since the ArXiv S3 bucket is a requester-pays bucket.

The S3 listing output is parsed by splitting on whitespace and taking every 4th element starting from index 3 (which corresponds to the filename column in s5cmd ls output), then filtering for entries containing .tar.

Usage

Use ArxivUrlGenerator as the URL generation component in an ArXiv download pipeline. It provides the initial list of tar file URLs that are then passed to ArxivDownloader for downloading.

Code Reference

Source Location

Repository: NeMo-Curator
File: nemo_curator/stages/text/download/arxiv/url_generation.py
Lines: 1-40

Signature

@dataclass
class ArxivUrlGenerator(URLGenerator):
    """Generates URLs for Arxiv data."""

    def generate_urls(self) -> list[str]:
        ...

    def _get_arxiv_urls(self) -> list[str]:
        ...

Import

from nemo_curator.stages.text.download.arxiv.url_generation import ArxivUrlGenerator

I/O Contract

Inputs

Name	Type	Required	Description
(none)	-	-	No constructor parameters; the S3 bucket path is hardcoded

Outputs

Name	Type	Description
return value	`list[str]`	Sorted list of ArXiv tar file names available on S3 (e.g., `["arXiv_src_0001_001.tar", "arXiv_src_0001_002.tar", ...]`)

Key Methods

generate_urls

Public interface method (from URLGenerator). Delegates to _get_arxiv_urls().

_get_arxiv_urls

Runs the following shell command:

s5cmd --request-payer=requester ls s3://arxiv/src/ | grep '.tar'

Parses the output to extract tar filenames, sorts them, and returns the list. Raises a RuntimeError if the command fails (non-zero return code).

Usage Examples

Basic Usage

from nemo_curator.stages.text.download.arxiv.url_generation import ArxivUrlGenerator

url_gen = ArxivUrlGenerator()
urls = url_gen.generate_urls()

print(f"Found {len(urls)} ArXiv tar files")
for url in urls[:5]:
    print(f"  {url}")

With URL Limit in Pipeline

from nemo_curator.stages.text.download.base.url_generation import URLGenerationStage
from nemo_curator.stages.text.download.arxiv.url_generation import ArxivUrlGenerator

url_stage = URLGenerationStage(
    url_generator=ArxivUrlGenerator(),
    limit=10,  # Only process the first 10 tar files
)

Dependencies

s5cmd: Must be installed on the system. Available from https://github.com/peak/s5cmd.
AWS credentials: Required for requester-pays access to the ArXiv S3 bucket.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment