Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Workflow:NVIDIA NeMo Curator Text Curation Pipeline

From Leeroopedia
Knowledge Sources
Domains Data_Engineering, NLP, LLM_Training
Last Updated 2026-02-14 17:00 GMT

Overview

End-to-end process for curating high-quality text datasets from web-scale sources for large language model training using NeMo Curator's stage-based pipeline architecture.

Description

This workflow outlines the standard procedure for preparing text data for LLM pretraining and fine-tuning. It covers the full lifecycle from acquiring raw text data (Common Crawl, Wikipedia, arXiv, or custom sources) through cleaning, filtering, quality assessment, deduplication, and final export. The pipeline leverages GPU-accelerated processing via RAPIDS cuDF/cuML and distributed execution on Ray clusters. Each stage in the pipeline is a composable ProcessingStage that can be configured independently and executed through the Pipeline API or YAML-driven configuration.

Usage

Execute this workflow when you have raw text data from web crawls, academic archives, or custom sources and need to produce a clean, deduplicated, high-quality text corpus for LLM training. The pipeline is designed for datasets ranging from gigabytes to petabytes and supports both CPU and GPU processing modes.

Execution Steps

Step 1: Data Acquisition

Acquire raw text data from one or more supported sources. NeMo Curator provides built-in download and extraction stages for Common Crawl (WARC files), Wikipedia (dump files), and arXiv (LaTeX source archives). For custom data sources, implement the DocumentDownloader, DocumentIterator, and DocumentExtractor abstract classes to create a custom download pipeline. Each source uses a composite stage that chains URL generation, downloading, iteration, and extraction into a single pipeline unit.

Key considerations:

  • Common Crawl requires generating URLs from crawl indices and downloading WARC files
  • Wikipedia processing extracts clean text from MediaWiki markup, removing tables, infoboxes, and references
  • arXiv extraction converts LaTeX source to plain text
  • Custom sources must implement the base abstract classes for URL generation, downloading, and extraction
  • Output is written as JSONL or Parquet files with a text column

Step 2: Content Processing and Cleaning

Apply text modifiers to normalize and clean the raw text content. This includes fixing Unicode encoding errors (mojibake), normalizing excessive newlines, removing Markdown formatting artifacts, stripping URLs, removing wrapping quotation marks, and applying C4-style cleaning rules. Multiple modifiers can be chained together in sequence through the Modify stage, which applies one or more DocumentModifier implementations to each document.

Key considerations:

  • Unicode reformatting uses the ftfy library for encoding error detection and repair
  • C4 modifier applies cleaning rules from the C4 dataset methodology
  • Line remover filters out lines matching a regex pattern
  • Modifiers are applied in the order they are added to the pipeline
  • Each modifier operates on the document's text field and returns the cleaned text

Step 3: Quality Assessment and Filtering

Score documents using heuristic filters, FastText classifiers, and GPU-accelerated deep learning classifiers, then remove documents that fall below quality thresholds. Heuristic filters check properties like word count, mean word length, symbol-to-word ratio, bullet-to-line ratio, and repeated content. Classifier-based filters use trained models including domain classifiers, quality classifiers, FineWeb-Edu educational quality scorers, content type classifiers, and AEGIS safety classifiers. The ScoreFilter module combines scoring and filtering into efficient pipeline stages.

Key considerations:

  • Heuristic filters operate on CPU and check 25+ document properties
  • FastText-based language identification and quality filters require model files
  • GPU classifiers (domain, quality, FineWeb-Edu, AEGIS) use HuggingFace transformer models
  • Score and filter can be decoupled: score first, then filter based on thresholds
  • Code-specific filters check for syntax validity, comment ratios, and language identification

Step 4: Deduplication

Remove duplicate and near-duplicate documents using one or more deduplication strategies. NeMo Curator supports three approaches: exact deduplication (hash-based identification using connected components), fuzzy deduplication (MinHash + LSH + connected components for near-duplicate detection), and semantic deduplication (embedding + KMeans clustering + pairwise similarity). Each deduplication method is implemented as a WorkflowBase subclass with its own multi-stage pipeline. For text pipelines, the typical approach is exact deduplication followed by fuzzy deduplication.

Key considerations:

  • Exact deduplication hashes the text column and finds identical documents via connected components
  • Fuzzy deduplication uses character n-gram MinHash signatures with configurable band/hash parameters
  • Semantic deduplication requires pre-computed embeddings and GPU resources for KMeans and pairwise similarity
  • Each method produces a set of document IDs to remove
  • Removal is performed by the TextDuplicatesRemovalWorkflow which reads originals, filters by removal IDs, and writes deduplicated output

Step 5: Export

Write the curated text dataset to the target output format. NeMo Curator supports JSONL, Parquet, and Megatron tokenized binary output formats. The ParquetWriter, JsonlWriter, and MegatronTokenizerWriter stages handle serialization, partitioning, and optional compression. For Megatron training, the tokenizer writer converts text to token IDs using a specified tokenizer and writes binary files compatible with Megatron-LM's data loading.

Key considerations:

  • Parquet output supports configurable row group sizes and compression codecs
  • JSONL output writes one JSON object per line with configurable fields
  • Megatron tokenizer writer requires a tokenizer model file and produces binary format
  • Output can be written to local filesystem or cloud storage via fsspec
  • File partitioning and naming can be configured for downstream training workflows

Execution Diagram

GitHub URL

Workflow Repository