Principle:NVIDIA NeMo Curator Text Cleaning and Normalization
| Knowledge Sources | |
|---|---|
| Domains | Data_Curation, NLP, Text_Processing |
| Last Updated | 2026-02-14 17:00 GMT |
Overview
Technique for transforming raw text into a clean, normalized form by fixing encoding errors, removing boilerplate, and standardizing whitespace for downstream NLP processing.
Description
Text Cleaning and Normalization encompasses a family of text transformation operations that convert noisy, web-scraped text into a consistent, clean format suitable for language model training. This includes fixing Unicode encoding errors (mojibake), normalizing excessive whitespace and newlines, removing boilerplate content, and applying dataset-specific cleaning rules (such as C4-style filtering). These operations are critical because web-crawled text contains a wide variety of encoding issues, formatting artifacts, and non-textual content that can degrade model training quality.
In NeMo Curator, this is implemented through the Modify stage which applies pluggable DocumentModifier functions such as UnicodeReformatter (using ftfy), NewlineNormalizer, and C4Modifier.
Usage
Use this principle as the second step in a text curation pipeline, immediately after data acquisition and before quality filtering. Apply Unicode fixing first, then whitespace normalization, then domain-specific cleaning rules.
Theoretical Basis
Text cleaning follows a layered normalization approach:
- Encoding Repair: Fix mojibake and encoding errors using statistical detection (ftfy library uses heuristics to detect and fix common encoding mistakes like UTF-8 interpreted as Latin-1)
- Whitespace Normalization: Collapse excessive newlines (3+ consecutive) to double newlines; normalize tabs and spaces
- Content Cleaning: Apply rule-based transformations specific to the data source (e.g., C4 rules: remove lines with curly braces, remove "javascript" references, filter short sentences)
Pseudo-code:
# Abstract cleaning pipeline
def clean_document(text: str, modifiers: list[Modifier]) -> str:
for modifier in modifiers:
text = modifier.modify_document(text)
return text
# Typical modifier chain
modifiers = [
UnicodeReformatter(), # Fix encoding errors
NewlineNormalizer(), # Normalize whitespace
C4Modifier(), # Apply C4 cleaning rules
]