Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:NVIDIA NeMo Curator Base DocumentExtractor

From Leeroopedia
Knowledge Sources
Domains Data Extraction, Abstract Base Class, Pipeline Infrastructure
Last Updated 2026-02-14 00:00 GMT

Overview

DocumentExtractor is the abstract base class for all document extractors in NeMo Curator, defining the interface for transforming raw record dictionaries into final structured records.

Description

The DocumentExtractor ABC provides a minimal interface for content extraction and transformation. It separates the concern of transforming raw document content (e.g., HTML to text, LaTeX to plain text) from the concern of iterating over files, enabling different extraction strategies to be swapped independently of the iterator.

The class defines three abstract methods:

  • extract(record): Takes a raw record dictionary and returns a processed dictionary, or None to skip the record. This is where the actual content transformation logic lives.
  • input_columns(): Declares the expected input field names in the record dictionary.
  • output_columns(): Declares the field names produced in the output dictionary.

Extractors are used by DocumentIterateExtractStage, which calls extract() on each record yielded by a DocumentIterator.

Usage

Subclass DocumentExtractor to implement content transformation logic for specific data formats. The extractor is then passed to a DocumentIterateExtractStage or a DocumentDownloadExtractStage composite.

Code Reference

Source Location

  • Repository: NeMo-Curator
  • File: nemo_curator/stages/text/download/base/extract.py
  • Lines: 1-40

Signature

class DocumentExtractor(ABC):
    """Abstract base class for document extractors.

    Takes a record dict and returns processed record dict or None to skip.
    Can transform any fields in the input dict.
    """

    @abstractmethod
    def extract(self, record: dict[str, str]) -> dict[str, Any] | None:
        """Extract/transform a record dict into final record dict."""
        ...

    @abstractmethod
    def input_columns(self) -> list[str]:
        """Define input columns - produces DocumentBatch with records."""
        ...

    @abstractmethod
    def output_columns(self) -> list[str]:
        """Define output columns - produces DocumentBatch with records."""
        ...

Import

from nemo_curator.stages.text.download.base.extract import DocumentExtractor
# Or via the package shortcut:
from nemo_curator.stages.text.download import DocumentExtractor

I/O Contract

Inputs

Name Type Required Description
record dict[str, str] Yes Raw record dictionary with fields as defined by input_columns()

Outputs

Name Type Description
return value None Processed record dictionary with fields as defined by output_columns(), or None to skip the record

Abstract Methods

Method Return Type Description
extract(record) None Transform a raw record into a structured record; return None to skip
input_columns() list[str] Declare the expected input field names
output_columns() list[str] Declare the produced output field names

Usage Examples

Implementing a Custom Extractor

from typing import Any
from nemo_curator.stages.text.download import DocumentExtractor


class HtmlToTextExtractor(DocumentExtractor):
    """Extracts plain text from HTML content."""

    def extract(self, record: dict[str, str]) -> dict[str, Any] | None:
        html = record.get("html", "")
        if not html:
            return None
        # Perform HTML-to-text conversion
        text = strip_html_tags(html)
        return {"text": text, "url": record.get("url", "")}

    def input_columns(self) -> list[str]:
        return ["html", "url"]

    def output_columns(self) -> list[str]:
        return ["text", "url"]

Wiring Into a Pipeline

from nemo_curator.stages.text.download.base.iterator import DocumentIterateExtractStage

stage = DocumentIterateExtractStage(
    iterator=my_iterator,
    extractor=HtmlToTextExtractor(),
)

Known Implementations

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment