Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:NVIDIA NeMo Curator ArXiv Iterator

From Leeroopedia
Knowledge Sources
Domains Data Iteration, ArXiv, Archive Processing
Last Updated 2026-02-14 00:00 GMT

Overview

ArxivIterator is a concrete implementation of DocumentIterator that processes downloaded ArXiv tar archives, extracts individual paper records from nested tar/gzip files, loads their LaTeX content, and yields structured dictionaries per paper.

Description

The ArxivIterator class bridges downloaded ArXiv tar archives and the extraction step. Each top-level tar file from the ArXiv S3 bucket contains many nested archives -- one per paper -- which may be either tar files (for multi-file submissions) or gzip files (for single-file submissions).

The iterator performs the following steps:

  1. Safely extracts the top-level tar archive into a temporary directory (using tar_safe_extract to prevent path traversal attacks).
  2. Enumerates all files in the extracted directory recursively.
  3. For each nested file, attempts to load .tex content by:
    • First trying to open it as a tar file and extracting all .tex members.
    • If that fails, trying to open it as a gzip file.
    • Decoding all content as UTF-8.
  4. Formats the ArXiv ID into the correct specification format (handling both pre- and post-March 2007 formats).
  5. Yields a dictionary with the paper ID, source tar name, and list of LaTeX file contents.

The implementation is based in large part on the Red-Pajama ArXiv data preparation code.

Usage

Use ArxivIterator as the iterator component in an ArXiv download pipeline. It consumes downloaded tar files and produces per-paper records that are then passed to ArxivExtractor for LaTeX cleaning.

Code Reference

Source Location

  • Repository: NeMo-Curator
  • File: nemo_curator/stages/text/download/arxiv/iterator.py
  • Lines: 1-152

Signature

class ArxivIterator(DocumentIterator):
    """Processes downloaded Arxiv files and extracts article content."""

    def __init__(self, log_frequency: int = 1000):
        ...

    def _tex_proj_loader(self, file_or_dir_path: str) -> list[str] | None:
        ...

    def _format_arxiv_id(self, arxiv_id: str) -> str:
        ...

    def iterate(self, file_path: str) -> Iterator[dict[str, Any]]:
        ...

    def output_columns(self) -> list[str]:
        ...

Import

from nemo_curator.stages.text.download.arxiv.iterator import ArxivIterator

I/O Contract

Inputs

Name Type Required Description
log_frequency int No How often to log extraction progress (default: every 1000 papers)
file_path str Yes Path to a downloaded ArXiv tar archive (passed to iterate())

Outputs

Name Type Description
id str Formatted ArXiv paper identifier (e.g., 2301.00001 or hep-ph/0301001)
source_id str Name of the source tar file
content list[str] List of raw LaTeX file contents for the paper

Key Methods

_tex_proj_loader

Loads .tex files from a nested archive. First attempts to open as a tar file, extracting all members ending with .tex. If that fails with a tarfile.ReadError, falls back to gzip. All content is decoded as UTF-8; files with decode errors are skipped. Returns None on any failure.

_format_arxiv_id

Converts raw ArXiv identifiers into specification-compliant format:

Input Format Output Format Example
<archive><YY><MM><NNN> <archive>/<YYMMNNN> hepph0301001 becomes hep-ph/0301001
<YY><MM><NNNNN> <YYMM.NNNNN> 230100001 stays as numeric portion

Reference: ArXiv Identifier Specification

iterate

Main generator method. Extracts the top-level tar, walks all nested files, loads LaTeX content via _tex_proj_loader, formats the ArXiv ID, and yields record dictionaries. Papers with no loadable .tex content are silently skipped.

Usage Examples

Basic Usage

from nemo_curator.stages.text.download.arxiv.iterator import ArxivIterator

iterator = ArxivIterator(log_frequency=500)

for record in iterator.iterate("/data/arxiv/raw/arXiv_src_2301_001.tar"):
    print(f"Paper ID: {record['id']}")
    print(f"Number of .tex files: {len(record['content'])}")

As Part of a Pipeline

from nemo_curator.stages.text.download.base.iterator import DocumentIterateExtractStage
from nemo_curator.stages.text.download.arxiv.iterator import ArxivIterator
from nemo_curator.stages.text.download.arxiv.extract import ArxivExtractor

stage = DocumentIterateExtractStage(
    iterator=ArxivIterator(log_frequency=1000),
    extractor=ArxivExtractor(),
)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment