Implementation:NVIDIA NeMo Curator BoilerPlateStringModifier
| Knowledge Sources | |
|---|---|
| Domains | Data_Curation, Text_Cleaning, Boilerplate_Removal |
| Last Updated | 2026-02-14 00:00 GMT |
Overview
BoilerPlateStringModifier is a document modifier that detects and removes boilerplate text paragraphs (such as "terms of use", "privacy policy", and "lorem ipsum") from web-crawled documents, adapted from the Google C4 dataset processing methodology.
Description
BoilerPlateStringModifier extends DocumentModifier and implements a paragraph-level boilerplate detection strategy. It iterates over each paragraph in a document, checking whether the lowercase content contains any of the known policy substrings defined in the policy_substrings constant list. If a paragraph containing "lorem ipsum" is found, the entire document is discarded (returns an empty string). For paragraphs matching other policy substrings, the behavior depends on the remove_if_at_top_or_bottom flag:
- When
remove_if_at_top_or_bottomis False, any match causes the entire document to be discarded. - When
remove_if_at_top_or_bottomis True (the default), the modifier collects boilerplate paragraph indices and usesis_paragraph_indices_in_top_or_bottom_only()to determine whether the boilerplate paragraphs appear exclusively at the top or bottom of the document. If they do, only those paragraphs are removed and the remaining text is rejoined. If boilerplate is scattered throughout the middle, the document is returned unchanged.
The modifier also maintains internal state for cached paragraph lists via self._paragraphs, which is reset to None after modification since the document structure has changed.
Usage
Use BoilerPlateStringModifier when cleaning web-crawled text datasets that may contain common navigation, legal, or cookie-consent boilerplate. It is especially useful as part of a C4-style text cleaning pipeline and is typically passed into the Modify processing stage.
Code Reference
Source Location
- Repository: NeMo-Curator
- File:
nemo_curator/stages/text/modifiers/c4.py - Lines: 1-87
Signature
class BoilerPlateStringModifier(DocumentModifier):
def __init__(
self,
remove_if_at_top_or_bottom: bool = True,
):
...
def modify_document(self, text: str) -> str:
...
Import
from nemo_curator.stages.text.modifiers.c4 import BoilerPlateStringModifier
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| remove_if_at_top_or_bottom | bool |
No (default: True) |
Constructor parameter. When True, boilerplate paragraphs found only at the top or bottom of the document are selectively removed. When False, any boilerplate match causes the entire document to be discarded. |
| text | str |
Yes | The document text passed to modify_document().
|
Outputs
| Name | Type | Description |
|---|---|---|
| return value | str |
The modified document text. Returns the original text if no boilerplate is found, an empty string if the document should be discarded, or the text with boilerplate paragraphs removed if they were only at the top or bottom. |
Usage Examples
Basic Usage
from nemo_curator.stages.text.modifiers.c4 import BoilerPlateStringModifier
# Create the modifier with default settings (remove top/bottom boilerplate)
modifier = BoilerPlateStringModifier(remove_if_at_top_or_bottom=True)
text = "Privacy Policy\n\nThis is the main content of the article.\n\nTerms of Use"
result = modifier.modify_document(text)
# Boilerplate paragraphs at top and bottom are removed
Strict Mode (Discard Entire Document)
from nemo_curator.stages.text.modifiers.c4 import BoilerPlateStringModifier
# Create the modifier in strict mode (discard document on any match)
modifier = BoilerPlateStringModifier(remove_if_at_top_or_bottom=False)
text = "This is the main content.\n\nPlease accept our cookie policy."
result = modifier.modify_document(text)
# Returns empty string because a policy substring was found
Lorem Ipsum Detection
from nemo_curator.stages.text.modifiers.c4 import BoilerPlateStringModifier
modifier = BoilerPlateStringModifier()
text = "Lorem ipsum dolor sit amet, consectetur adipiscing elit."
result = modifier.modify_document(text)
# Returns empty string because "lorem ipsum" placeholder text was detected
Related Pages
- NVIDIA_NeMo_Curator_DocumentModifier — Base class that this modifier extends
- Environment:NVIDIA_NeMo_Curator_Python_Linux_Base