Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:NVIDIA NeMo Curator ArXiv Extractor

From Leeroopedia
Revision as of 13:19, 16 February 2026 by Admin (talk | contribs) (Auto-imported from implementations/NVIDIA_NeMo_Curator_ArXiv_Extractor.md)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Knowledge Sources
Domains Data Extraction, ArXiv, LaTeX
Last Updated 2026-02-14 00:00 GMT

Overview

ArxivExtractor is a concrete implementation of DocumentExtractor that extracts and cleans plain text from ArXiv LaTeX source files by parsing macros, removing comments, stripping preamble/bibliography sections, and inline-expanding macro definitions.

Description

The ArxivExtractor class processes records containing raw LaTeX content from ArXiv papers and produces cleaned text suitable for NLP training data. The implementation is based in large part on the Red-Pajama ArXiv data preparation code.

The extraction pipeline performs the following steps for each record:

  1. Macro discovery: Scans all .tex files in the record to build a dictionary of non-argument macros defined via \newcommand and \def.
  2. Content cleaning: For each .tex file:
    • Removes all content before the first section-like header (\chapter, \part, \section, \subsection, \subsubsection, \paragraph, \subparagraph).
    • Strips line comments (lines starting with %) and inline comments.
    • Removes everything after \appendix, \bibliography, or \begin{references}.
    • Inline-expands all discovered non-argument macros.
  3. Concatenation: Joins cleaned content from multiple .tex files with newline separators.

Records with empty content or that fail processing are skipped (return None).

Usage

Use ArxivExtractor as the extractor component in an ArXiv download pipeline. It receives records produced by ArxivIterator (containing lists of raw LaTeX file contents) and outputs records with a single text field containing cleaned plain text.

Code Reference

Source Location

  • Repository: NeMo-Curator
  • File: nemo_curator/stages/text/download/arxiv/extract.py
  • Lines: 1-208

Signature

class ArxivExtractor(DocumentExtractor):
    """Extracts text from Arxiv LaTeX files."""

    def __init__(self):
        ...

    def _build_non_arg_macros_dict(self, file_content: str) -> dict[str, str]:
        ...

    def _clean_tex_file(
        self,
        file_content: str,
        arg_macros: dict[str, str],
        non_arg_macros: dict[str, str],
    ) -> str:
        ...

    def extract(self, record: dict[str, str]) -> dict[str, Any] | None:
        ...

    def input_columns(self) -> list[str]:
        ...

    def output_columns(self) -> list[str]:
        ...

Import

from nemo_curator.stages.text.download.arxiv.extract import ArxivExtractor

I/O Contract

Inputs

Name Type Required Description
record dict[str, str] Yes Dictionary containing record data with the following keys
record["id"] str Yes ArXiv paper identifier
record["source_id"] str Yes Source tar file identifier
record["content"] list[str] Yes List of raw LaTeX file content strings

Outputs

Name Type Description
return value None Dictionary with key "text" containing cleaned LaTeX text, or None if extraction fails or content is empty

Column Definitions

Direction Columns
Input id, source_id, content
Output text

Key Methods

_build_non_arg_macros_dict

Extracts all non-argument LaTeX macro definitions from file content. Handles both \newcommand{\name}{value} and \def\name{value} patterns. Returns a dictionary mapping macro names to their expanded values (both as unicode-escaped strings for use in regex substitution).

_clean_tex_file

Performs the core LaTeX cleaning:

  • Strips content before the first section-like header
  • Removes line comments and inline comments
  • Truncates at appendix or bibliography
  • Inline-expands all non-argument macros

Returns an empty string if no section-like header is found.

extract

Main entry point. Builds macro dictionaries across all .tex files in the record, cleans each file, and joins them. Returns None for empty or failed records.

Usage Examples

Basic Usage

from nemo_curator.stages.text.download.arxiv.extract import ArxivExtractor

extractor = ArxivExtractor()

record = {
    "id": "2301.00001",
    "source_id": "arXiv_src_2301_001.tar",
    "content": [
        r"\newcommand{\model}{GPT}\section{Introduction}Our \model{} is...",
    ],
}

result = extractor.extract(record)
# result: {"text": "\\section{Introduction}Our GPT is..."}

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment