Implementation:NVIDIA NeMo Curator LineRemover
| Knowledge Sources | |
|---|---|
| Domains | Data_Curation, Text_Cleaning |
| Last Updated | 2026-02-14 00:00 GMT |
Overview
LineRemover is a document modifier that removes lines from a document when the entire line content exactly matches one of a given set of pattern strings.
Description
LineRemover extends DocumentModifier and provides a simple line-level filtering mechanism. It accepts a list of pattern strings at construction time. When modify_document() is called, the modifier splits the input text by newline characters (\n), filters out any line whose content is an exact match to any pattern in the list, and rejoins the remaining lines with newline characters.
The matching is exact -- a line must be identical to a pattern string to be removed. Partial matches or substring matches do not trigger removal. This makes the modifier suitable for removing known, fixed-content lines such as repeated headers, footers, or boilerplate strings.
Usage
Use LineRemover when you need to strip specific known lines from documents. This is useful for cleaning web-scraped text that contains predictable boilerplate lines (e.g., "Subscribe to our newsletter", navigation breadcrumbs, or repeated section dividers). It is typically passed into the Modify processing stage.
Code Reference
Source Location
- Repository: NeMo-Curator
- File:
nemo_curator/stages/text/modifiers/line_remover.py - Lines: 1-35
Signature
class LineRemover(DocumentModifier):
def __init__(self, patterns: list[str]):
...
def modify_document(self, text: str) -> str:
...
Import
from nemo_curator.stages.text.modifiers.line_remover import LineRemover
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| patterns | list[str] |
Yes | Constructor parameter. A list of strings; any line in the document that exactly matches one of these strings will be removed. |
| text | str |
Yes | The document text passed to modify_document().
|
Outputs
| Name | Type | Description |
|---|---|---|
| return value | str |
The document text with matching lines removed. Lines are rejoined with newline characters. |
Usage Examples
Basic Usage
from nemo_curator.stages.text.modifiers.line_remover import LineRemover
# Remove specific boilerplate lines
modifier = LineRemover(patterns=["Subscribe to our newsletter", "Back to top"])
text = "Article Title\nSubscribe to our newsletter\nArticle content here.\nBack to top"
result = modifier.modify_document(text)
# Returns "Article Title\nArticle content here."
Removing Empty Lines
from nemo_curator.stages.text.modifiers.line_remover import LineRemover
# Remove blank lines (exact empty string match)
modifier = LineRemover(patterns=[""])
text = "Line 1\n\nLine 2\n\nLine 3"
result = modifier.modify_document(text)
# Returns "Line 1\nLine 2\nLine 3"
Related Pages
- NVIDIA_NeMo_Curator_DocumentModifier — Base class that this modifier extends
- NVIDIA_NeMo_Curator_BoilerPlateStringModifier — Alternative boilerplate removal using substring matching at the paragraph level
- Environment:NVIDIA_NeMo_Curator_Python_Linux_Base