Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:NVIDIA NeMo Curator Modify

From Leeroopedia
Knowledge Sources
Domains Data_Curation, NLP, Text_Processing
Last Updated 2026-02-14 17:00 GMT

Overview

Concrete tool for applying pluggable text modification functions to document batches provided by NeMo Curator.

Description

The Modify stage is a processing stage that applies a DocumentModifier function to the text field of each document in a batch. It supports chaining multiple modifiers (UnicodeReformatter for mojibake repair via ftfy, NewlineNormalizer for whitespace cleanup, C4Modifier for C4-style cleaning rules). The stage operates on DataFrames in-place, modifying the specified text column.

Usage

Import this stage when you need to clean and normalize text content in a curation pipeline. Chain multiple Modify stages with different modifiers for comprehensive cleaning.

Code Reference

Source Location

  • Repository: NeMo Curator
  • File: nemo_curator/stages/text/modules/modifier.py
  • Lines: L30-222

Signature

class Modify(ProcessingStage):
    def __init__(
        self,
        modifier_fn: DocumentModifier,
        text_field: str = "text",
        name: str = "modify",
    ):
        """
        Args:
            modifier_fn: The DocumentModifier to apply (e.g., UnicodeReformatter,
                NewlineNormalizer, C4Modifier).
            text_field: Column name containing text to modify.
            name: Stage name for logging.
        """

Import

from nemo_curator.stages.text.modules.modifier import Modify
from nemo_curator.stages.text.modifiers.unicode_reformatter import UnicodeReformatter
from nemo_curator.stages.text.modifiers.newline_normalizer import NewlineNormalizer
from nemo_curator.stages.text.modifiers.c4 import C4Modifier

I/O Contract

Inputs

Name Type Required Description
task DocumentBatch Yes DataFrame with text column from prior stage
modifier_fn DocumentModifier Yes Modifier implementing modify_document(text) -> text
text_field str No Column name containing text (default: "text")

Outputs

Name Type Description
task DocumentBatch DataFrame with cleaned text column (modified in-place)

Usage Examples

Unicode Cleaning

from nemo_curator.stages.text.modules.modifier import Modify
from nemo_curator.stages.text.modifiers.unicode_reformatter import UnicodeReformatter
from nemo_curator.pipeline import Pipeline

# Fix encoding errors in text
unicode_stage = Modify(
    modifier_fn=UnicodeReformatter(),
    text_field="text",
)

pipeline = Pipeline()
pipeline.add_stage(unicode_stage)

Full Cleaning Pipeline

from nemo_curator.stages.text.modules.modifier import Modify
from nemo_curator.stages.text.modifiers.unicode_reformatter import UnicodeReformatter
from nemo_curator.stages.text.modifiers.newline_normalizer import NewlineNormalizer
from nemo_curator.stages.text.modifiers.c4 import C4Modifier
from nemo_curator.pipeline import Pipeline

pipeline = Pipeline()

# 1. Fix Unicode/encoding errors
pipeline.add_stage(Modify(modifier_fn=UnicodeReformatter()))

# 2. Normalize whitespace
pipeline.add_stage(Modify(modifier_fn=NewlineNormalizer()))

# 3. Apply C4-style cleaning
pipeline.add_stage(Modify(modifier_fn=C4Modifier()))

Related Pages

Implements Principle

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment