Implementation:NVIDIA NeMo Curator Base DocumentExtractor
| Knowledge Sources | |
|---|---|
| Domains | Data Extraction, Abstract Base Class, Pipeline Infrastructure |
| Last Updated | 2026-02-14 00:00 GMT |
Overview
DocumentExtractor is the abstract base class for all document extractors in NeMo Curator, defining the interface for transforming raw record dictionaries into final structured records.
Description
The DocumentExtractor ABC provides a minimal interface for content extraction and transformation. It separates the concern of transforming raw document content (e.g., HTML to text, LaTeX to plain text) from the concern of iterating over files, enabling different extraction strategies to be swapped independently of the iterator.
The class defines three abstract methods:
extract(record): Takes a raw record dictionary and returns a processed dictionary, orNoneto skip the record. This is where the actual content transformation logic lives.input_columns(): Declares the expected input field names in the record dictionary.output_columns(): Declares the field names produced in the output dictionary.
Extractors are used by DocumentIterateExtractStage, which calls extract() on each record yielded by a DocumentIterator.
Usage
Subclass DocumentExtractor to implement content transformation logic for specific data formats. The extractor is then passed to a DocumentIterateExtractStage or a DocumentDownloadExtractStage composite.
Code Reference
Source Location
- Repository: NeMo-Curator
- File:
nemo_curator/stages/text/download/base/extract.py - Lines: 1-40
Signature
class DocumentExtractor(ABC):
"""Abstract base class for document extractors.
Takes a record dict and returns processed record dict or None to skip.
Can transform any fields in the input dict.
"""
@abstractmethod
def extract(self, record: dict[str, str]) -> dict[str, Any] | None:
"""Extract/transform a record dict into final record dict."""
...
@abstractmethod
def input_columns(self) -> list[str]:
"""Define input columns - produces DocumentBatch with records."""
...
@abstractmethod
def output_columns(self) -> list[str]:
"""Define output columns - produces DocumentBatch with records."""
...
Import
from nemo_curator.stages.text.download.base.extract import DocumentExtractor
# Or via the package shortcut:
from nemo_curator.stages.text.download import DocumentExtractor
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| record | dict[str, str] |
Yes | Raw record dictionary with fields as defined by input_columns()
|
Outputs
| Name | Type | Description |
|---|---|---|
| return value | None | Processed record dictionary with fields as defined by output_columns(), or None to skip the record
|
Abstract Methods
| Method | Return Type | Description |
|---|---|---|
extract(record) |
None | Transform a raw record into a structured record; return None to skip
|
input_columns() |
list[str] |
Declare the expected input field names |
output_columns() |
list[str] |
Declare the produced output field names |
Usage Examples
Implementing a Custom Extractor
from typing import Any
from nemo_curator.stages.text.download import DocumentExtractor
class HtmlToTextExtractor(DocumentExtractor):
"""Extracts plain text from HTML content."""
def extract(self, record: dict[str, str]) -> dict[str, Any] | None:
html = record.get("html", "")
if not html:
return None
# Perform HTML-to-text conversion
text = strip_html_tags(html)
return {"text": text, "url": record.get("url", "")}
def input_columns(self) -> list[str]:
return ["html", "url"]
def output_columns(self) -> list[str]:
return ["text", "url"]
Wiring Into a Pipeline
from nemo_curator.stages.text.download.base.iterator import DocumentIterateExtractStage
stage = DocumentIterateExtractStage(
iterator=my_iterator,
extractor=HtmlToTextExtractor(),
)
Known Implementations
- ArxivExtractor -- Extracts and cleans text from ArXiv LaTeX source files
- Wikipedia Extractor -- Extracts text from Wikipedia dumps