Implementation:NVIDIA NeMo Curator NewlineNormalizer
| Knowledge Sources | |
|---|---|
| Domains | Data_Curation, Text_Cleaning, Whitespace_Normalization |
| Last Updated | 2026-02-14 00:00 GMT |
Overview
NewlineNormalizer is a document modifier that collapses runs of three or more consecutive newline characters into exactly two newlines, normalizing excessive vertical whitespace in documents.
Description
NewlineNormalizer extends DocumentModifier and uses two precompiled regular expressions to handle both Unix-style and Windows-style line endings:
- Unix newlines: The pattern
(\n){3,}matches three or more consecutive\ncharacters and replaces them with\n\n. - Windows newlines: The pattern
(\r\n){3,}matches three or more consecutive\r\nsequences and replaces them with\r\n\r\n.
Both substitutions are applied sequentially in modify_document(), with Unix newlines processed first, followed by Windows newlines. The regex patterns are compiled at module load time for efficiency.
The effect is that paragraph boundaries (double newlines) are preserved, but excessive blank lines commonly found in web-scraped or poorly formatted documents are collapsed.
Usage
Use NewlineNormalizer as part of a text normalization pipeline to clean up excessive blank lines in documents. This is particularly useful for web-scraped content where HTML-to-text conversion may produce many consecutive blank lines.
Code Reference
Source Location
- Repository: NeMo-Curator
- File:
nemo_curator/stages/text/modifiers/newline_normalizer.py - Lines: 1-34
Signature
class NewlineNormalizer(DocumentModifier):
def __init__(self):
...
def modify_document(self, text: str) -> str:
...
Module-Level Constants
THREE_OR_MORE_NEWLINES_REGEX = re.compile(r"(\n){3,}")
THREE_OR_MORE_WINDOWS_NEWLINES_REGEX = re.compile(r"(\r\n){3,}")
Import
from nemo_curator.stages.text.modifiers.newline_normalizer import NewlineNormalizer
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| text | str |
Yes | The document text passed to modify_document().
|
Outputs
| Name | Type | Description |
|---|---|---|
| return value | str |
The document text with runs of 3+ consecutive newlines collapsed to exactly 2 newlines. Both Unix (\n) and Windows (\r\n) line endings are handled.
|
Usage Examples
Basic Usage
from nemo_curator.stages.text.modifiers.newline_normalizer import NewlineNormalizer
modifier = NewlineNormalizer()
# Text with excessive blank lines
text = "Paragraph one.\n\n\n\n\nParagraph two.\n\n\nParagraph three."
result = modifier.modify_document(text)
# Returns "Paragraph one.\n\nParagraph two.\n\nParagraph three."
Windows Line Endings
from nemo_curator.stages.text.modifiers.newline_normalizer import NewlineNormalizer
modifier = NewlineNormalizer()
# Text with excessive Windows-style blank lines
text = "First paragraph.\r\n\r\n\r\n\r\nSecond paragraph."
result = modifier.modify_document(text)
# Returns "First paragraph.\r\n\r\nSecond paragraph."
Related Pages
- NVIDIA_NeMo_Curator_DocumentModifier — Base class that this modifier extends
- NVIDIA_NeMo_Curator_LineRemover — Complementary modifier for removing specific line content
- Environment:NVIDIA_NeMo_Curator_Python_Linux_Base