Implementation:NVIDIA NeMo Curator ArXiv URLGenerator
| Knowledge Sources | |
|---|---|
| Domains | Data Acquisition, ArXiv, S3 |
| Last Updated | 2026-02-14 00:00 GMT |
Overview
ArxivUrlGenerator is a concrete implementation of URLGenerator that discovers all available ArXiv source tar files on S3 by listing the s3://arxiv/src/ bucket using s5cmd.
Description
The ArxivUrlGenerator class is a dataclass that extends URLGenerator. It runs the s5cmd command-line tool to list the contents of the ArXiv S3 bucket, filters for .tar files, parses the filenames from the listing output, sorts them alphabetically, and returns the sorted list. The command uses the --request-payer=requester flag since the ArXiv S3 bucket is a requester-pays bucket.
The S3 listing output is parsed by splitting on whitespace and taking every 4th element starting from index 3 (which corresponds to the filename column in s5cmd ls output), then filtering for entries containing .tar.
Usage
Use ArxivUrlGenerator as the URL generation component in an ArXiv download pipeline. It provides the initial list of tar file URLs that are then passed to ArxivDownloader for downloading.
Code Reference
Source Location
- Repository: NeMo-Curator
- File:
nemo_curator/stages/text/download/arxiv/url_generation.py - Lines: 1-40
Signature
@dataclass
class ArxivUrlGenerator(URLGenerator):
"""Generates URLs for Arxiv data."""
def generate_urls(self) -> list[str]:
...
def _get_arxiv_urls(self) -> list[str]:
...
Import
from nemo_curator.stages.text.download.arxiv.url_generation import ArxivUrlGenerator
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| (none) | - | - | No constructor parameters; the S3 bucket path is hardcoded |
Outputs
| Name | Type | Description |
|---|---|---|
| return value | list[str] |
Sorted list of ArXiv tar file names available on S3 (e.g., ["arXiv_src_0001_001.tar", "arXiv_src_0001_002.tar", ...])
|
Key Methods
generate_urls
Public interface method (from URLGenerator). Delegates to _get_arxiv_urls().
_get_arxiv_urls
Runs the following shell command:
s5cmd --request-payer=requester ls s3://arxiv/src/ | grep '.tar'
Parses the output to extract tar filenames, sorts them, and returns the list. Raises a RuntimeError if the command fails (non-zero return code).
Usage Examples
Basic Usage
from nemo_curator.stages.text.download.arxiv.url_generation import ArxivUrlGenerator
url_gen = ArxivUrlGenerator()
urls = url_gen.generate_urls()
print(f"Found {len(urls)} ArXiv tar files")
for url in urls[:5]:
print(f" {url}")
With URL Limit in Pipeline
from nemo_curator.stages.text.download.base.url_generation import URLGenerationStage
from nemo_curator.stages.text.download.arxiv.url_generation import ArxivUrlGenerator
url_stage = URLGenerationStage(
url_generator=ArxivUrlGenerator(),
limit=10, # Only process the first 10 tar files
)
Dependencies
- s5cmd: Must be installed on the system. Available from https://github.com/peak/s5cmd.
- AWS credentials: Required for requester-pays access to the ArXiv S3 bucket.