Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:NVIDIA NeMo Curator Text Deduplication

From Leeroopedia
Knowledge Sources
Domains Data_Curation, NLP, Deduplication
Last Updated 2026-02-14 17:00 GMT

Overview

Technique for identifying and removing duplicate or near-duplicate text documents from large-scale corpora to improve language model training data quality.

Description

Text Deduplication addresses the well-documented problem that duplicate content in training data leads to memorization, reduced generalization, and wasted compute. NeMo Curator provides three deduplication strategies: Exact Deduplication (hash-based identification), Fuzzy Deduplication (MinHash + LSH for near-duplicate detection), and Text Duplicates Removal (removing identified duplicates from the dataset). These can be composed as workflow objects that orchestrate multi-stage GPU-accelerated pipelines.

Usage

Use text deduplication as the final filtering step before data export. Apply exact deduplication first (cheaper, catches verbatim copies), then fuzzy deduplication (catches near-duplicates with minor edits). For large-scale datasets, the GPU-accelerated fuzzy deduplication pipeline with RAPIDS cuDF is recommended.

Theoretical Basis

Text deduplication strategies:

Exact Deduplication:

  • Hash each document (e.g., MD5/SHA-256 of normalized text)
  • Documents with identical hashes are exact duplicates
  • Use connected components to group duplicate clusters

Fuzzy Deduplication:

  • Compute MinHash signatures from character n-gram shingles
  • Apply Locality-Sensitive Hashing (LSH) to group similar documents into buckets
  • Convert buckets to edge pairs and find connected components
  • Select one representative per component

Pseudo-code:

# Abstract deduplication
def deduplicate(documents, method="fuzzy"):
    if method == "exact":
        hashes = {doc.id: hash(doc.text) for doc in documents}
        duplicates = find_identical_hashes(hashes)
    elif method == "fuzzy":
        signatures = compute_minhash(documents, num_hashes=260)
        buckets = lsh(signatures, num_bands=20, hashes_per_band=13)
        edges = buckets_to_edges(buckets)
        components = connected_components(edges)
        duplicates = identify_duplicates(components)
    return remove_duplicates(documents, duplicates)

Related Pages

Implemented By

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment