Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:NVIDIA NeMo Curator LineRemover

From Leeroopedia
Revision as of 13:21, 16 February 2026 by Admin (talk | contribs) (Auto-imported from implementations/NVIDIA_NeMo_Curator_LineRemover.md)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Knowledge Sources
Domains Data_Curation, Text_Cleaning
Last Updated 2026-02-14 00:00 GMT

Overview

LineRemover is a document modifier that removes lines from a document when the entire line content exactly matches one of a given set of pattern strings.

Description

LineRemover extends DocumentModifier and provides a simple line-level filtering mechanism. It accepts a list of pattern strings at construction time. When modify_document() is called, the modifier splits the input text by newline characters (\n), filters out any line whose content is an exact match to any pattern in the list, and rejoins the remaining lines with newline characters.

The matching is exact -- a line must be identical to a pattern string to be removed. Partial matches or substring matches do not trigger removal. This makes the modifier suitable for removing known, fixed-content lines such as repeated headers, footers, or boilerplate strings.

Usage

Use LineRemover when you need to strip specific known lines from documents. This is useful for cleaning web-scraped text that contains predictable boilerplate lines (e.g., "Subscribe to our newsletter", navigation breadcrumbs, or repeated section dividers). It is typically passed into the Modify processing stage.

Code Reference

Source Location

  • Repository: NeMo-Curator
  • File: nemo_curator/stages/text/modifiers/line_remover.py
  • Lines: 1-35

Signature

class LineRemover(DocumentModifier):
    def __init__(self, patterns: list[str]):
        ...

    def modify_document(self, text: str) -> str:
        ...

Import

from nemo_curator.stages.text.modifiers.line_remover import LineRemover

I/O Contract

Inputs

Name Type Required Description
patterns list[str] Yes Constructor parameter. A list of strings; any line in the document that exactly matches one of these strings will be removed.
text str Yes The document text passed to modify_document().

Outputs

Name Type Description
return value str The document text with matching lines removed. Lines are rejoined with newline characters.

Usage Examples

Basic Usage

from nemo_curator.stages.text.modifiers.line_remover import LineRemover

# Remove specific boilerplate lines
modifier = LineRemover(patterns=["Subscribe to our newsletter", "Back to top"])

text = "Article Title\nSubscribe to our newsletter\nArticle content here.\nBack to top"
result = modifier.modify_document(text)
# Returns "Article Title\nArticle content here."

Removing Empty Lines

from nemo_curator.stages.text.modifiers.line_remover import LineRemover

# Remove blank lines (exact empty string match)
modifier = LineRemover(patterns=[""])

text = "Line 1\n\nLine 2\n\nLine 3"
result = modifier.modify_document(text)
# Returns "Line 1\nLine 2\nLine 3"

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment