Implementation:NVIDIA NeMo Curator DocumentModifier
| Knowledge Sources | |
|---|---|
| Domains | Data_Curation, Text_Processing, Abstract_Base_Class |
| Last Updated | 2026-02-14 00:00 GMT |
Overview
DocumentModifier is the abstract base class that defines the interface for all text-based document modifiers in the NeMo Curator text processing pipeline.
Description
DocumentModifier inherits from Python's ABC (Abstract Base Class) and declares a single abstract method, modify_document(), which all concrete modifier subclasses must implement. The class provides:
modify_document(*args, **kwargs): An abstract method that takes document content and returns the modified result. It supports both single-input usage (e.g.,modify_document(text)) and multi-input usage (e.g.,modify_document(column_1=..., column_2=...)) where each input field is expanded as a keyword argument.nameproperty: Returns the modifier's name, defaulting to the class name (self.__class__.__name__) if not explicitly overridden. This is useful for identification in pipeline logs.
The constructor initializes three internal cache attributes (_sentences, _paragraphs, _ngrams) set to None. These can be used by subclasses to cache intermediate text decompositions across method calls, avoiding redundant computation.
Usage
DocumentModifier is not instantiated directly. Instead, subclass it and implement modify_document() to create a custom text transformation. All built-in modifiers in NeMo Curator (such as BoilerPlateStringModifier, LineRemover, MarkdownRemover, NewlineNormalizer, QuotationRemover, Slicer, and UrlRemover) inherit from this class. The Modify processing stage depends on this interface to apply transformations uniformly.
Code Reference
Source Location
- Repository: NeMo-Curator
- File:
nemo_curator/stages/text/modifiers/doc_modifier.py - Lines: 1-45
Signature
class DocumentModifier(ABC):
def __init__(self) -> None:
...
@abstractmethod
def modify_document(self, *args: object, **kwargs: object) -> object:
"""Transform the provided value(s) and return the result."""
raise NotImplementedError
@property
def name(self) -> str:
return self._name
Import
from nemo_curator.stages.text.modifiers.doc_modifier import DocumentModifier
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| *args | object |
Varies | Positional arguments passed to modify_document(). For single-input modifiers, this is typically a single str containing the document text.
|
| **kwargs | object |
Varies | Keyword arguments passed to modify_document(). Used for multi-input modifiers where each column is passed as a named keyword argument.
|
Outputs
| Name | Type | Description |
|---|---|---|
| return value | object |
The modified document content. For most text modifiers, this is a str.
|
Internal State
| Attribute | Type | Description |
|---|---|---|
_name |
str |
Name of the modifier, defaults to the class name. Subclasses may override. |
_sentences |
None or cached |
Optional cache for sentence decomposition of the document. |
_paragraphs |
None or cached |
Optional cache for paragraph decomposition of the document. |
_ngrams |
None or cached |
Optional cache for n-gram decomposition of the document. |
Usage Examples
Creating a Custom Modifier
from nemo_curator.stages.text.modifiers.doc_modifier import DocumentModifier
class UpperCaseModifier(DocumentModifier):
"""Converts all text to upper case."""
def __init__(self):
super().__init__()
def modify_document(self, text: str) -> str:
return text.upper()
modifier = UpperCaseModifier()
result = modifier.modify_document("hello world")
# Returns "HELLO WORLD"
print(modifier.name)
# Returns "UpperCaseModifier"
Multi-Input Modifier
from nemo_curator.stages.text.modifiers.doc_modifier import DocumentModifier
class ConcatModifier(DocumentModifier):
"""Concatenates two text columns."""
def __init__(self, separator: str = " "):
super().__init__()
self._separator = separator
def modify_document(self, **kwargs) -> str:
return self._separator.join(kwargs.values())
modifier = ConcatModifier(separator=" | ")
result = modifier.modify_document(column_1="Hello", column_2="World")
# Returns "Hello | World"
Related Pages
- NVIDIA_NeMo_Curator_BoilerPlateStringModifier — Concrete subclass for boilerplate removal
- NVIDIA_NeMo_Curator_LineRemover — Concrete subclass for exact line removal
- NVIDIA_NeMo_Curator_MarkdownRemover — Concrete subclass for Markdown stripping
- NVIDIA_NeMo_Curator_NewlineNormalizer — Concrete subclass for newline normalization
- NVIDIA_NeMo_Curator_QuotationRemover — Concrete subclass for quotation removal
- NVIDIA_NeMo_Curator_Slicer — Concrete subclass for text slicing
- NVIDIA_NeMo_Curator_UrlRemover — Concrete subclass for URL removal
- Environment:NVIDIA_NeMo_Curator_Python_Linux_Base