Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:NVIDIA NeMo Curator BoilerPlateStringModifier

From Leeroopedia
Knowledge Sources
Domains Data_Curation, Text_Cleaning, Boilerplate_Removal
Last Updated 2026-02-14 00:00 GMT

Overview

BoilerPlateStringModifier is a document modifier that detects and removes boilerplate text paragraphs (such as "terms of use", "privacy policy", and "lorem ipsum") from web-crawled documents, adapted from the Google C4 dataset processing methodology.

Description

BoilerPlateStringModifier extends DocumentModifier and implements a paragraph-level boilerplate detection strategy. It iterates over each paragraph in a document, checking whether the lowercase content contains any of the known policy substrings defined in the policy_substrings constant list. If a paragraph containing "lorem ipsum" is found, the entire document is discarded (returns an empty string). For paragraphs matching other policy substrings, the behavior depends on the remove_if_at_top_or_bottom flag:

  • When remove_if_at_top_or_bottom is False, any match causes the entire document to be discarded.
  • When remove_if_at_top_or_bottom is True (the default), the modifier collects boilerplate paragraph indices and uses is_paragraph_indices_in_top_or_bottom_only() to determine whether the boilerplate paragraphs appear exclusively at the top or bottom of the document. If they do, only those paragraphs are removed and the remaining text is rejoined. If boilerplate is scattered throughout the middle, the document is returned unchanged.

The modifier also maintains internal state for cached paragraph lists via self._paragraphs, which is reset to None after modification since the document structure has changed.

Usage

Use BoilerPlateStringModifier when cleaning web-crawled text datasets that may contain common navigation, legal, or cookie-consent boilerplate. It is especially useful as part of a C4-style text cleaning pipeline and is typically passed into the Modify processing stage.

Code Reference

Source Location

  • Repository: NeMo-Curator
  • File: nemo_curator/stages/text/modifiers/c4.py
  • Lines: 1-87

Signature

class BoilerPlateStringModifier(DocumentModifier):
    def __init__(
        self,
        remove_if_at_top_or_bottom: bool = True,
    ):
        ...

    def modify_document(self, text: str) -> str:
        ...

Import

from nemo_curator.stages.text.modifiers.c4 import BoilerPlateStringModifier

I/O Contract

Inputs

Name Type Required Description
remove_if_at_top_or_bottom bool No (default: True) Constructor parameter. When True, boilerplate paragraphs found only at the top or bottom of the document are selectively removed. When False, any boilerplate match causes the entire document to be discarded.
text str Yes The document text passed to modify_document().

Outputs

Name Type Description
return value str The modified document text. Returns the original text if no boilerplate is found, an empty string if the document should be discarded, or the text with boilerplate paragraphs removed if they were only at the top or bottom.

Usage Examples

Basic Usage

from nemo_curator.stages.text.modifiers.c4 import BoilerPlateStringModifier

# Create the modifier with default settings (remove top/bottom boilerplate)
modifier = BoilerPlateStringModifier(remove_if_at_top_or_bottom=True)

text = "Privacy Policy\n\nThis is the main content of the article.\n\nTerms of Use"
result = modifier.modify_document(text)
# Boilerplate paragraphs at top and bottom are removed

Strict Mode (Discard Entire Document)

from nemo_curator.stages.text.modifiers.c4 import BoilerPlateStringModifier

# Create the modifier in strict mode (discard document on any match)
modifier = BoilerPlateStringModifier(remove_if_at_top_or_bottom=False)

text = "This is the main content.\n\nPlease accept our cookie policy."
result = modifier.modify_document(text)
# Returns empty string because a policy substring was found

Lorem Ipsum Detection

from nemo_curator.stages.text.modifiers.c4 import BoilerPlateStringModifier

modifier = BoilerPlateStringModifier()

text = "Lorem ipsum dolor sit amet, consectetur adipiscing elit."
result = modifier.modify_document(text)
# Returns empty string because "lorem ipsum" placeholder text was detected

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment