Principle:NVIDIA NeMo Curator Text Deduplication
| Knowledge Sources | |
|---|---|
| Domains | Data_Curation, NLP, Deduplication |
| Last Updated | 2026-02-14 17:00 GMT |
Overview
Technique for identifying and removing duplicate or near-duplicate text documents from large-scale corpora to improve language model training data quality.
Description
Text Deduplication addresses the well-documented problem that duplicate content in training data leads to memorization, reduced generalization, and wasted compute. NeMo Curator provides three deduplication strategies: Exact Deduplication (hash-based identification), Fuzzy Deduplication (MinHash + LSH for near-duplicate detection), and Text Duplicates Removal (removing identified duplicates from the dataset). These can be composed as workflow objects that orchestrate multi-stage GPU-accelerated pipelines.
Usage
Use text deduplication as the final filtering step before data export. Apply exact deduplication first (cheaper, catches verbatim copies), then fuzzy deduplication (catches near-duplicates with minor edits). For large-scale datasets, the GPU-accelerated fuzzy deduplication pipeline with RAPIDS cuDF is recommended.
Theoretical Basis
Text deduplication strategies:
Exact Deduplication:
- Hash each document (e.g., MD5/SHA-256 of normalized text)
- Documents with identical hashes are exact duplicates
- Use connected components to group duplicate clusters
Fuzzy Deduplication:
- Compute MinHash signatures from character n-gram shingles
- Apply Locality-Sensitive Hashing (LSH) to group similar documents into buckets
- Convert buckets to edge pairs and find connected components
- Select one representative per component
Pseudo-code:
# Abstract deduplication
def deduplicate(documents, method="fuzzy"):
if method == "exact":
hashes = {doc.id: hash(doc.text) for doc in documents}
duplicates = find_identical_hashes(hashes)
elif method == "fuzzy":
signatures = compute_minhash(documents, num_hashes=260)
buckets = lsh(signatures, num_bands=20, hashes_per_band=13)
edges = buckets_to_edges(buckets)
components = connected_components(edges)
duplicates = identify_duplicates(components)
return remove_duplicates(documents, duplicates)