Workflow:NVIDIA NeMo Curator Text Curation Pipeline
| Knowledge Sources | |
|---|---|
| Domains | Data_Engineering, NLP, LLM_Training |
| Last Updated | 2026-02-14 17:00 GMT |
Overview
End-to-end process for curating high-quality text datasets from web-scale sources for large language model training using NeMo Curator's stage-based pipeline architecture.
Description
This workflow outlines the standard procedure for preparing text data for LLM pretraining and fine-tuning. It covers the full lifecycle from acquiring raw text data (Common Crawl, Wikipedia, arXiv, or custom sources) through cleaning, filtering, quality assessment, deduplication, and final export. The pipeline leverages GPU-accelerated processing via RAPIDS cuDF/cuML and distributed execution on Ray clusters. Each stage in the pipeline is a composable ProcessingStage that can be configured independently and executed through the Pipeline API or YAML-driven configuration.
Usage
Execute this workflow when you have raw text data from web crawls, academic archives, or custom sources and need to produce a clean, deduplicated, high-quality text corpus for LLM training. The pipeline is designed for datasets ranging from gigabytes to petabytes and supports both CPU and GPU processing modes.
Execution Steps
Step 1: Data Acquisition
Acquire raw text data from one or more supported sources. NeMo Curator provides built-in download and extraction stages for Common Crawl (WARC files), Wikipedia (dump files), and arXiv (LaTeX source archives). For custom data sources, implement the DocumentDownloader, DocumentIterator, and DocumentExtractor abstract classes to create a custom download pipeline. Each source uses a composite stage that chains URL generation, downloading, iteration, and extraction into a single pipeline unit.
Key considerations:
- Common Crawl requires generating URLs from crawl indices and downloading WARC files
- Wikipedia processing extracts clean text from MediaWiki markup, removing tables, infoboxes, and references
- arXiv extraction converts LaTeX source to plain text
- Custom sources must implement the base abstract classes for URL generation, downloading, and extraction
- Output is written as JSONL or Parquet files with a text column
Step 2: Content Processing and Cleaning
Apply text modifiers to normalize and clean the raw text content. This includes fixing Unicode encoding errors (mojibake), normalizing excessive newlines, removing Markdown formatting artifacts, stripping URLs, removing wrapping quotation marks, and applying C4-style cleaning rules. Multiple modifiers can be chained together in sequence through the Modify stage, which applies one or more DocumentModifier implementations to each document.
Key considerations:
- Unicode reformatting uses the ftfy library for encoding error detection and repair
- C4 modifier applies cleaning rules from the C4 dataset methodology
- Line remover filters out lines matching a regex pattern
- Modifiers are applied in the order they are added to the pipeline
- Each modifier operates on the document's text field and returns the cleaned text
Step 3: Quality Assessment and Filtering
Score documents using heuristic filters, FastText classifiers, and GPU-accelerated deep learning classifiers, then remove documents that fall below quality thresholds. Heuristic filters check properties like word count, mean word length, symbol-to-word ratio, bullet-to-line ratio, and repeated content. Classifier-based filters use trained models including domain classifiers, quality classifiers, FineWeb-Edu educational quality scorers, content type classifiers, and AEGIS safety classifiers. The ScoreFilter module combines scoring and filtering into efficient pipeline stages.
Key considerations:
- Heuristic filters operate on CPU and check 25+ document properties
- FastText-based language identification and quality filters require model files
- GPU classifiers (domain, quality, FineWeb-Edu, AEGIS) use HuggingFace transformer models
- Score and filter can be decoupled: score first, then filter based on thresholds
- Code-specific filters check for syntax validity, comment ratios, and language identification
Step 4: Deduplication
Remove duplicate and near-duplicate documents using one or more deduplication strategies. NeMo Curator supports three approaches: exact deduplication (hash-based identification using connected components), fuzzy deduplication (MinHash + LSH + connected components for near-duplicate detection), and semantic deduplication (embedding + KMeans clustering + pairwise similarity). Each deduplication method is implemented as a WorkflowBase subclass with its own multi-stage pipeline. For text pipelines, the typical approach is exact deduplication followed by fuzzy deduplication.
Key considerations:
- Exact deduplication hashes the text column and finds identical documents via connected components
- Fuzzy deduplication uses character n-gram MinHash signatures with configurable band/hash parameters
- Semantic deduplication requires pre-computed embeddings and GPU resources for KMeans and pairwise similarity
- Each method produces a set of document IDs to remove
- Removal is performed by the TextDuplicatesRemovalWorkflow which reads originals, filters by removal IDs, and writes deduplicated output
Step 5: Export
Write the curated text dataset to the target output format. NeMo Curator supports JSONL, Parquet, and Megatron tokenized binary output formats. The ParquetWriter, JsonlWriter, and MegatronTokenizerWriter stages handle serialization, partitioning, and optional compression. For Megatron training, the tokenizer writer converts text to token IDs using a specified tokenizer and writes binary files compatible with Megatron-LM's data loading.
Key considerations:
- Parquet output supports configurable row group sizes and compression codecs
- JSONL output writes one JSON object per line with configurable fields
- Megatron tokenizer writer requires a tokenizer model file and produces binary format
- Output can be written to local filesystem or cloud storage via fsspec
- File partitioning and naming can be configured for downstream training workflows