Implementation:NVIDIA NeMo Curator ArXiv Extractor

Knowledge Sources	NVIDIA NeMo Curator
Domains	Data Extraction, ArXiv, LaTeX
Last Updated	2026-02-14 00:00 GMT

Overview

ArxivExtractor is a concrete implementation of DocumentExtractor that extracts and cleans plain text from ArXiv LaTeX source files by parsing macros, removing comments, stripping preamble/bibliography sections, and inline-expanding macro definitions.

Description

The ArxivExtractor class processes records containing raw LaTeX content from ArXiv papers and produces cleaned text suitable for NLP training data. The implementation is based in large part on the Red-Pajama ArXiv data preparation code.

The extraction pipeline performs the following steps for each record:

Macro discovery: Scans all .tex files in the record to build a dictionary of non-argument macros defined via \newcommand and \def.
Content cleaning: For each .tex file:
- Removes all content before the first section-like header (\chapter, \part, \section, \subsection, \subsubsection, \paragraph, \subparagraph).
- Strips line comments (lines starting with %) and inline comments.
- Removes everything after \appendix, \bibliography, or \begin{references}.
- Inline-expands all discovered non-argument macros.
Concatenation: Joins cleaned content from multiple .tex files with newline separators.

Records with empty content or that fail processing are skipped (return None).

Usage

Use ArxivExtractor as the extractor component in an ArXiv download pipeline. It receives records produced by ArxivIterator (containing lists of raw LaTeX file contents) and outputs records with a single text field containing cleaned plain text.

Code Reference

Source Location

Repository: NeMo-Curator
File: nemo_curator/stages/text/download/arxiv/extract.py
Lines: 1-208

Signature

class ArxivExtractor(DocumentExtractor):
    """Extracts text from Arxiv LaTeX files."""

    def __init__(self):
        ...

    def _build_non_arg_macros_dict(self, file_content: str) -> dict[str, str]:
        ...

    def _clean_tex_file(
        self,
        file_content: str,
        arg_macros: dict[str, str],
        non_arg_macros: dict[str, str],
    ) -> str:
        ...

    def extract(self, record: dict[str, str]) -> dict[str, Any] | None:
        ...

    def input_columns(self) -> list[str]:
        ...

    def output_columns(self) -> list[str]:
        ...

Import

from nemo_curator.stages.text.download.arxiv.extract import ArxivExtractor

I/O Contract

Inputs

Name	Type	Required	Description
record	`dict[str, str]`	Yes	Dictionary containing record data with the following keys
record["id"]	`str`	Yes	ArXiv paper identifier
record["source_id"]	`str`	Yes	Source tar file identifier
record["content"]	`list[str]`	Yes	List of raw LaTeX file content strings

Outputs

Name	Type	Description
return value	None	Dictionary with key `"text"` containing cleaned LaTeX text, or `None` if extraction fails or content is empty

Column Definitions

Direction	Columns
Input	`id`, `source_id`, `content`
Output	`text`

Key Methods

_build_non_arg_macros_dict

Extracts all non-argument LaTeX macro definitions from file content. Handles both \newcommand{\name}{value} and \def\name{value} patterns. Returns a dictionary mapping macro names to their expanded values (both as unicode-escaped strings for use in regex substitution).

_clean_tex_file

Performs the core LaTeX cleaning:

Strips content before the first section-like header
Removes line comments and inline comments
Truncates at appendix or bibliography
Inline-expands all non-argument macros

Returns an empty string if no section-like header is found.

extract

Main entry point. Builds macro dictionaries across all .tex files in the record, cleans each file, and joins them. Returns None for empty or failed records.

Usage Examples

Basic Usage

from nemo_curator.stages.text.download.arxiv.extract import ArxivExtractor

extractor = ArxivExtractor()

record = {
    "id": "2301.00001",
    "source_id": "arXiv_src_2301_001.tar",
    "content": [
        r"\newcommand{\model}{GPT}\section{Introduction}Our \model{} is...",
    ],
}

result = extractor.extract(record)
# result: {"text": "\\section{Introduction}Our GPT is..."}

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment