Implementation:NVIDIA NeMo Curator Modify
| Knowledge Sources | |
|---|---|
| Domains | Data_Curation, NLP, Text_Processing |
| Last Updated | 2026-02-14 17:00 GMT |
Overview
Concrete tool for applying pluggable text modification functions to document batches provided by NeMo Curator.
Description
The Modify stage is a processing stage that applies a DocumentModifier function to the text field of each document in a batch. It supports chaining multiple modifiers (UnicodeReformatter for mojibake repair via ftfy, NewlineNormalizer for whitespace cleanup, C4Modifier for C4-style cleaning rules). The stage operates on DataFrames in-place, modifying the specified text column.
Usage
Import this stage when you need to clean and normalize text content in a curation pipeline. Chain multiple Modify stages with different modifiers for comprehensive cleaning.
Code Reference
Source Location
- Repository: NeMo Curator
- File: nemo_curator/stages/text/modules/modifier.py
- Lines: L30-222
Signature
class Modify(ProcessingStage):
def __init__(
self,
modifier_fn: DocumentModifier,
text_field: str = "text",
name: str = "modify",
):
"""
Args:
modifier_fn: The DocumentModifier to apply (e.g., UnicodeReformatter,
NewlineNormalizer, C4Modifier).
text_field: Column name containing text to modify.
name: Stage name for logging.
"""
Import
from nemo_curator.stages.text.modules.modifier import Modify
from nemo_curator.stages.text.modifiers.unicode_reformatter import UnicodeReformatter
from nemo_curator.stages.text.modifiers.newline_normalizer import NewlineNormalizer
from nemo_curator.stages.text.modifiers.c4 import C4Modifier
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| task | DocumentBatch | Yes | DataFrame with text column from prior stage |
| modifier_fn | DocumentModifier | Yes | Modifier implementing modify_document(text) -> text |
| text_field | str | No | Column name containing text (default: "text") |
Outputs
| Name | Type | Description |
|---|---|---|
| task | DocumentBatch | DataFrame with cleaned text column (modified in-place) |
Usage Examples
Unicode Cleaning
from nemo_curator.stages.text.modules.modifier import Modify
from nemo_curator.stages.text.modifiers.unicode_reformatter import UnicodeReformatter
from nemo_curator.pipeline import Pipeline
# Fix encoding errors in text
unicode_stage = Modify(
modifier_fn=UnicodeReformatter(),
text_field="text",
)
pipeline = Pipeline()
pipeline.add_stage(unicode_stage)
Full Cleaning Pipeline
from nemo_curator.stages.text.modules.modifier import Modify
from nemo_curator.stages.text.modifiers.unicode_reformatter import UnicodeReformatter
from nemo_curator.stages.text.modifiers.newline_normalizer import NewlineNormalizer
from nemo_curator.stages.text.modifiers.c4 import C4Modifier
from nemo_curator.pipeline import Pipeline
pipeline = Pipeline()
# 1. Fix Unicode/encoding errors
pipeline.add_stage(Modify(modifier_fn=UnicodeReformatter()))
# 2. Normalize whitespace
pipeline.add_stage(Modify(modifier_fn=NewlineNormalizer()))
# 3. Apply C4-style cleaning
pipeline.add_stage(Modify(modifier_fn=C4Modifier()))