Implementation:NVIDIA NeMo Curator ArXiv Extractor
| Knowledge Sources | |
|---|---|
| Domains | Data Extraction, ArXiv, LaTeX |
| Last Updated | 2026-02-14 00:00 GMT |
Overview
ArxivExtractor is a concrete implementation of DocumentExtractor that extracts and cleans plain text from ArXiv LaTeX source files by parsing macros, removing comments, stripping preamble/bibliography sections, and inline-expanding macro definitions.
Description
The ArxivExtractor class processes records containing raw LaTeX content from ArXiv papers and produces cleaned text suitable for NLP training data. The implementation is based in large part on the Red-Pajama ArXiv data preparation code.
The extraction pipeline performs the following steps for each record:
- Macro discovery: Scans all
.texfiles in the record to build a dictionary of non-argument macros defined via\newcommandand\def. - Content cleaning: For each
.texfile:- Removes all content before the first section-like header (
\chapter,\part,\section,\subsection,\subsubsection,\paragraph,\subparagraph). - Strips line comments (lines starting with
%) and inline comments. - Removes everything after
\appendix,\bibliography, or\begin{references}. - Inline-expands all discovered non-argument macros.
- Removes all content before the first section-like header (
- Concatenation: Joins cleaned content from multiple
.texfiles with newline separators.
Records with empty content or that fail processing are skipped (return None).
Usage
Use ArxivExtractor as the extractor component in an ArXiv download pipeline. It receives records produced by ArxivIterator (containing lists of raw LaTeX file contents) and outputs records with a single text field containing cleaned plain text.
Code Reference
Source Location
- Repository: NeMo-Curator
- File:
nemo_curator/stages/text/download/arxiv/extract.py - Lines: 1-208
Signature
class ArxivExtractor(DocumentExtractor):
"""Extracts text from Arxiv LaTeX files."""
def __init__(self):
...
def _build_non_arg_macros_dict(self, file_content: str) -> dict[str, str]:
...
def _clean_tex_file(
self,
file_content: str,
arg_macros: dict[str, str],
non_arg_macros: dict[str, str],
) -> str:
...
def extract(self, record: dict[str, str]) -> dict[str, Any] | None:
...
def input_columns(self) -> list[str]:
...
def output_columns(self) -> list[str]:
...
Import
from nemo_curator.stages.text.download.arxiv.extract import ArxivExtractor
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| record | dict[str, str] |
Yes | Dictionary containing record data with the following keys |
| record["id"] | str |
Yes | ArXiv paper identifier |
| record["source_id"] | str |
Yes | Source tar file identifier |
| record["content"] | list[str] |
Yes | List of raw LaTeX file content strings |
Outputs
| Name | Type | Description |
|---|---|---|
| return value | None | Dictionary with key "text" containing cleaned LaTeX text, or None if extraction fails or content is empty
|
Column Definitions
| Direction | Columns |
|---|---|
| Input | id, source_id, content
|
| Output | text
|
Key Methods
_build_non_arg_macros_dict
Extracts all non-argument LaTeX macro definitions from file content. Handles both \newcommand{\name}{value} and \def\name{value} patterns. Returns a dictionary mapping macro names to their expanded values (both as unicode-escaped strings for use in regex substitution).
_clean_tex_file
Performs the core LaTeX cleaning:
- Strips content before the first section-like header
- Removes line comments and inline comments
- Truncates at appendix or bibliography
- Inline-expands all non-argument macros
Returns an empty string if no section-like header is found.
extract
Main entry point. Builds macro dictionaries across all .tex files in the record, cleans each file, and joins them. Returns None for empty or failed records.
Usage Examples
Basic Usage
from nemo_curator.stages.text.download.arxiv.extract import ArxivExtractor
extractor = ArxivExtractor()
record = {
"id": "2301.00001",
"source_id": "arXiv_src_2301_001.tar",
"content": [
r"\newcommand{\model}{GPT}\section{Introduction}Our \model{} is...",
],
}
result = extractor.extract(record)
# result: {"text": "\\section{Introduction}Our GPT is..."}