Implementation:NVIDIA NeMo Curator Base DocumentExtractor

Knowledge Sources	NVIDIA NeMo Curator
Domains	Data Extraction, Abstract Base Class, Pipeline Infrastructure
Last Updated	2026-02-14 00:00 GMT

Overview

DocumentExtractor is the abstract base class for all document extractors in NeMo Curator, defining the interface for transforming raw record dictionaries into final structured records.

Description

The DocumentExtractor ABC provides a minimal interface for content extraction and transformation. It separates the concern of transforming raw document content (e.g., HTML to text, LaTeX to plain text) from the concern of iterating over files, enabling different extraction strategies to be swapped independently of the iterator.

The class defines three abstract methods:

extract(record): Takes a raw record dictionary and returns a processed dictionary, or None to skip the record. This is where the actual content transformation logic lives.
input_columns(): Declares the expected input field names in the record dictionary.
output_columns(): Declares the field names produced in the output dictionary.

Extractors are used by DocumentIterateExtractStage, which calls extract() on each record yielded by a DocumentIterator.

Usage

Subclass DocumentExtractor to implement content transformation logic for specific data formats. The extractor is then passed to a DocumentIterateExtractStage or a DocumentDownloadExtractStage composite.

Code Reference

Source Location

Repository: NeMo-Curator
File: nemo_curator/stages/text/download/base/extract.py
Lines: 1-40

Signature

class DocumentExtractor(ABC):
    """Abstract base class for document extractors.

    Takes a record dict and returns processed record dict or None to skip.
    Can transform any fields in the input dict.
    """

    @abstractmethod
    def extract(self, record: dict[str, str]) -> dict[str, Any] | None:
        """Extract/transform a record dict into final record dict."""
        ...

    @abstractmethod
    def input_columns(self) -> list[str]:
        """Define input columns - produces DocumentBatch with records."""
        ...

    @abstractmethod
    def output_columns(self) -> list[str]:
        """Define output columns - produces DocumentBatch with records."""
        ...

Import

from nemo_curator.stages.text.download.base.extract import DocumentExtractor
# Or via the package shortcut:
from nemo_curator.stages.text.download import DocumentExtractor

I/O Contract

Inputs

Name	Type	Required	Description
record	`dict[str, str]`	Yes	Raw record dictionary with fields as defined by `input_columns()`

Outputs

Name	Type	Description
return value	None	Processed record dictionary with fields as defined by `output_columns()`, or `None` to skip the record

Abstract Methods

Method	Return Type	Description
`extract(record)`	None	Transform a raw record into a structured record; return `None` to skip
`input_columns()`	`list[str]`	Declare the expected input field names
`output_columns()`	`list[str]`	Declare the produced output field names

Usage Examples

Implementing a Custom Extractor

from typing import Any
from nemo_curator.stages.text.download import DocumentExtractor


class HtmlToTextExtractor(DocumentExtractor):
    """Extracts plain text from HTML content."""

    def extract(self, record: dict[str, str]) -> dict[str, Any] | None:
        html = record.get("html", "")
        if not html:
            return None
        # Perform HTML-to-text conversion
        text = strip_html_tags(html)
        return {"text": text, "url": record.get("url", "")}

    def input_columns(self) -> list[str]:
        return ["html", "url"]

    def output_columns(self) -> list[str]:
        return ["text", "url"]

Wiring Into a Pipeline

from nemo_curator.stages.text.download.base.iterator import DocumentIterateExtractStage

stage = DocumentIterateExtractStage(
    iterator=my_iterator,
    extractor=HtmlToTextExtractor(),
)

Known Implementations

ArxivExtractor -- Extracts and cleans text from ArXiv LaTeX source files
Wikipedia Extractor -- Extracts text from Wikipedia dumps

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment