Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:NVIDIA NeMo Curator DocumentModifier

From Leeroopedia
Knowledge Sources
Domains Data_Curation, Text_Processing, Abstract_Base_Class
Last Updated 2026-02-14 00:00 GMT

Overview

DocumentModifier is the abstract base class that defines the interface for all text-based document modifiers in the NeMo Curator text processing pipeline.

Description

DocumentModifier inherits from Python's ABC (Abstract Base Class) and declares a single abstract method, modify_document(), which all concrete modifier subclasses must implement. The class provides:

  • modify_document(*args, **kwargs): An abstract method that takes document content and returns the modified result. It supports both single-input usage (e.g., modify_document(text)) and multi-input usage (e.g., modify_document(column_1=..., column_2=...)) where each input field is expanded as a keyword argument.
  • name property: Returns the modifier's name, defaulting to the class name (self.__class__.__name__) if not explicitly overridden. This is useful for identification in pipeline logs.

The constructor initializes three internal cache attributes (_sentences, _paragraphs, _ngrams) set to None. These can be used by subclasses to cache intermediate text decompositions across method calls, avoiding redundant computation.

Usage

DocumentModifier is not instantiated directly. Instead, subclass it and implement modify_document() to create a custom text transformation. All built-in modifiers in NeMo Curator (such as BoilerPlateStringModifier, LineRemover, MarkdownRemover, NewlineNormalizer, QuotationRemover, Slicer, and UrlRemover) inherit from this class. The Modify processing stage depends on this interface to apply transformations uniformly.

Code Reference

Source Location

  • Repository: NeMo-Curator
  • File: nemo_curator/stages/text/modifiers/doc_modifier.py
  • Lines: 1-45

Signature

class DocumentModifier(ABC):
    def __init__(self) -> None:
        ...

    @abstractmethod
    def modify_document(self, *args: object, **kwargs: object) -> object:
        """Transform the provided value(s) and return the result."""
        raise NotImplementedError

    @property
    def name(self) -> str:
        return self._name

Import

from nemo_curator.stages.text.modifiers.doc_modifier import DocumentModifier

I/O Contract

Inputs

Name Type Required Description
*args object Varies Positional arguments passed to modify_document(). For single-input modifiers, this is typically a single str containing the document text.
**kwargs object Varies Keyword arguments passed to modify_document(). Used for multi-input modifiers where each column is passed as a named keyword argument.

Outputs

Name Type Description
return value object The modified document content. For most text modifiers, this is a str.

Internal State

Attribute Type Description
_name str Name of the modifier, defaults to the class name. Subclasses may override.
_sentences None or cached Optional cache for sentence decomposition of the document.
_paragraphs None or cached Optional cache for paragraph decomposition of the document.
_ngrams None or cached Optional cache for n-gram decomposition of the document.

Usage Examples

Creating a Custom Modifier

from nemo_curator.stages.text.modifiers.doc_modifier import DocumentModifier

class UpperCaseModifier(DocumentModifier):
    """Converts all text to upper case."""

    def __init__(self):
        super().__init__()

    def modify_document(self, text: str) -> str:
        return text.upper()

modifier = UpperCaseModifier()
result = modifier.modify_document("hello world")
# Returns "HELLO WORLD"
print(modifier.name)
# Returns "UpperCaseModifier"

Multi-Input Modifier

from nemo_curator.stages.text.modifiers.doc_modifier import DocumentModifier

class ConcatModifier(DocumentModifier):
    """Concatenates two text columns."""

    def __init__(self, separator: str = " "):
        super().__init__()
        self._separator = separator

    def modify_document(self, **kwargs) -> str:
        return self._separator.join(kwargs.values())

modifier = ConcatModifier(separator=" | ")
result = modifier.modify_document(column_1="Hello", column_2="World")
# Returns "Hello | World"

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment