Principle:NVIDIA NeMo Curator Text Cleaning and Normalization

Knowledge Sources	NeMo Curator ftfy Documentation C4 Dataset
Domains	Data_Curation, NLP, Text_Processing
Last Updated	2026-02-14 17:00 GMT

Overview

Technique for transforming raw text into a clean, normalized form by fixing encoding errors, removing boilerplate, and standardizing whitespace for downstream NLP processing.

Description

Text Cleaning and Normalization encompasses a family of text transformation operations that convert noisy, web-scraped text into a consistent, clean format suitable for language model training. This includes fixing Unicode encoding errors (mojibake), normalizing excessive whitespace and newlines, removing boilerplate content, and applying dataset-specific cleaning rules (such as C4-style filtering). These operations are critical because web-crawled text contains a wide variety of encoding issues, formatting artifacts, and non-textual content that can degrade model training quality.

In NeMo Curator, this is implemented through the Modify stage which applies pluggable DocumentModifier functions such as UnicodeReformatter (using ftfy), NewlineNormalizer, and C4Modifier.

Usage

Use this principle as the second step in a text curation pipeline, immediately after data acquisition and before quality filtering. Apply Unicode fixing first, then whitespace normalization, then domain-specific cleaning rules.

Theoretical Basis

Text cleaning follows a layered normalization approach:

Encoding Repair: Fix mojibake and encoding errors using statistical detection (ftfy library uses heuristics to detect and fix common encoding mistakes like UTF-8 interpreted as Latin-1)
Whitespace Normalization: Collapse excessive newlines (3+ consecutive) to double newlines; normalize tabs and spaces
Content Cleaning: Apply rule-based transformations specific to the data source (e.g., C4 rules: remove lines with curly braces, remove "javascript" references, filter short sentences)

Pseudo-code:

# Abstract cleaning pipeline
def clean_document(text: str, modifiers: list[Modifier]) -> str:
    for modifier in modifiers:
        text = modifier.modify_document(text)
    return text

# Typical modifier chain
modifiers = [
    UnicodeReformatter(),      # Fix encoding errors
    NewlineNormalizer(),         # Normalize whitespace
    C4Modifier(),               # Apply C4 cleaning rules
]

Related Pages

Implemented By

Implementation:NVIDIA_NeMo_Curator_Modify

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment