Implementation:NVIDIA NeMo Curator ArXiv Iterator
| Knowledge Sources | |
|---|---|
| Domains | Data Iteration, ArXiv, Archive Processing |
| Last Updated | 2026-02-14 00:00 GMT |
Overview
ArxivIterator is a concrete implementation of DocumentIterator that processes downloaded ArXiv tar archives, extracts individual paper records from nested tar/gzip files, loads their LaTeX content, and yields structured dictionaries per paper.
Description
The ArxivIterator class bridges downloaded ArXiv tar archives and the extraction step. Each top-level tar file from the ArXiv S3 bucket contains many nested archives -- one per paper -- which may be either tar files (for multi-file submissions) or gzip files (for single-file submissions).
The iterator performs the following steps:
- Safely extracts the top-level tar archive into a temporary directory (using
tar_safe_extractto prevent path traversal attacks). - Enumerates all files in the extracted directory recursively.
- For each nested file, attempts to load
.texcontent by:- First trying to open it as a tar file and extracting all
.texmembers. - If that fails, trying to open it as a gzip file.
- Decoding all content as UTF-8.
- First trying to open it as a tar file and extracting all
- Formats the ArXiv ID into the correct specification format (handling both pre- and post-March 2007 formats).
- Yields a dictionary with the paper ID, source tar name, and list of LaTeX file contents.
The implementation is based in large part on the Red-Pajama ArXiv data preparation code.
Usage
Use ArxivIterator as the iterator component in an ArXiv download pipeline. It consumes downloaded tar files and produces per-paper records that are then passed to ArxivExtractor for LaTeX cleaning.
Code Reference
Source Location
- Repository: NeMo-Curator
- File:
nemo_curator/stages/text/download/arxiv/iterator.py - Lines: 1-152
Signature
class ArxivIterator(DocumentIterator):
"""Processes downloaded Arxiv files and extracts article content."""
def __init__(self, log_frequency: int = 1000):
...
def _tex_proj_loader(self, file_or_dir_path: str) -> list[str] | None:
...
def _format_arxiv_id(self, arxiv_id: str) -> str:
...
def iterate(self, file_path: str) -> Iterator[dict[str, Any]]:
...
def output_columns(self) -> list[str]:
...
Import
from nemo_curator.stages.text.download.arxiv.iterator import ArxivIterator
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| log_frequency | int |
No | How often to log extraction progress (default: every 1000 papers) |
| file_path | str |
Yes | Path to a downloaded ArXiv tar archive (passed to iterate())
|
Outputs
| Name | Type | Description |
|---|---|---|
| id | str |
Formatted ArXiv paper identifier (e.g., 2301.00001 or hep-ph/0301001)
|
| source_id | str |
Name of the source tar file |
| content | list[str] |
List of raw LaTeX file contents for the paper |
Key Methods
_tex_proj_loader
Loads .tex files from a nested archive. First attempts to open as a tar file, extracting all members ending with .tex. If that fails with a tarfile.ReadError, falls back to gzip. All content is decoded as UTF-8; files with decode errors are skipped. Returns None on any failure.
_format_arxiv_id
Converts raw ArXiv identifiers into specification-compliant format:
| Input Format | Output Format | Example |
|---|---|---|
<archive><YY><MM><NNN> |
<archive>/<YYMMNNN> |
hepph0301001 becomes hep-ph/0301001
|
<YY><MM><NNNNN> |
<YYMM.NNNNN> |
230100001 stays as numeric portion
|
Reference: ArXiv Identifier Specification
iterate
Main generator method. Extracts the top-level tar, walks all nested files, loads LaTeX content via _tex_proj_loader, formats the ArXiv ID, and yields record dictionaries. Papers with no loadable .tex content are silently skipped.
Usage Examples
Basic Usage
from nemo_curator.stages.text.download.arxiv.iterator import ArxivIterator
iterator = ArxivIterator(log_frequency=500)
for record in iterator.iterate("/data/arxiv/raw/arXiv_src_2301_001.tar"):
print(f"Paper ID: {record['id']}")
print(f"Number of .tex files: {len(record['content'])}")
As Part of a Pipeline
from nemo_curator.stages.text.download.base.iterator import DocumentIterateExtractStage
from nemo_curator.stages.text.download.arxiv.iterator import ArxivIterator
from nemo_curator.stages.text.download.arxiv.extract import ArxivExtractor
stage = DocumentIterateExtractStage(
iterator=ArxivIterator(log_frequency=1000),
extractor=ArxivExtractor(),
)