Workflow:ChenghaoMou Text dedup MinHash LSH Deduplication

Knowledge Sources	text-dedup Deduplicating Training Data Makes Language Models Better text-dedup Docs
Domains	Data_Engineering, NLP, Deduplication
Last Updated	2026-02-14 21:00 GMT

Overview

End-to-end process for near-duplicate text detection using MinHash fingerprinting with Locality-Sensitive Hashing (LSH) banding, Polars-based clustering, and optional false positive verification.

Description

This workflow implements the most commonly used and highest-accuracy deduplication algorithm in the text-dedup library. It detects near-duplicate documents by computing MinHash signatures (compact fingerprints that approximate Jaccard similarity), then groups similar documents using LSH banding to efficiently find candidate pairs without exhaustive pairwise comparison. Candidate clusters are formed using Polars DataFrames and the polars-grouper library for connected-component merging. An optional verification step removes false positives by computing actual Jaccard similarity on n-gram sets. The pipeline is config-driven via TOML files and operates on HuggingFace Datasets.

Goal: A deduplicated dataset with near-duplicate documents removed, plus optional cluster assignment metadata.

Scope: From raw text data (local files or HuggingFace datasets) through fingerprinting, clustering, verification, and filtered output.

Strategy: Uses MinHash + LSH for sub-quadratic candidate pair generation, Polars for efficient large-scale clustering, and n-gram Jaccard verification for precision.

Usage

Execute this workflow when you have a large text corpus (e.g., web crawl data, training datasets, academic papers) and need to remove near-duplicate documents to improve downstream model training quality or reduce dataset redundancy. The MinHash approach is recommended when approximate matching is acceptable and the dataset is too large for exact pairwise comparison. Configure the similarity threshold, number of permutations, and n-gram size via the TOML config file.

Execution Steps

Step 1: Configuration Loading

Parse the TOML configuration file into a typed Config object using pydantic-settings. The configuration specifies input data source (local files or HuggingFace dataset), algorithm parameters (number of permutations, similarity threshold, n-gram size, hash bits, false positive/negative weights), and output settings (directory, cluster saving, cache cleanup). The MinHash-specific config automatically computes optimal LSH band and row parameters from the threshold and permutation count.

Key considerations:

The number of permutations and threshold jointly determine band/row parameters for LSH
False positive and false negative weights control the LSH parameter optimization tradeoff
Hash bits can be 32 or 64, affecting memory usage and collision probability

Step 2: Data Loading and Preprocessing

Load the dataset from local files (parquet, csv, json) or a HuggingFace dataset path using the unified data I/O layer. Each document receives an internal index column for tracking through the pipeline. Short or empty documents are filtered out based on the algorithm's filtering function before fingerprinting begins.

Key considerations:

The loader supports both local file formats and HuggingFace dataset identifiers
An internal index column is added to every record for cluster tracking
Empty or below-threshold documents are pre-filtered to avoid wasted computation

Step 3: MinHash Fingerprinting

Compute MinHash signatures for each document using parallel map operations on the HuggingFace Dataset. Each document is tokenized into n-grams, then multiple hash functions (determined by num_perm) produce a compact signature. The signatures are split into bands for LSH, producing band index and band value columns that enable efficient grouping.

Key considerations:

N-gram tokenization is applied before hashing (configurable n-gram size)
The embedding function produces band index/value pairs ready for LSH grouping
Processing is parallelized across CPU cores via HuggingFace Dataset.map()

Step 4: LSH Clustering

Group documents by their LSH band index and band value using Polars DataFrames. Documents sharing at least one band bucket are candidate duplicates. Candidate pairs are formed by self-joining within each bucket, then connected components are found using the polars-grouper super_merger function to produce final cluster assignments. Each document is assigned to its cluster representative (minimum index in the cluster).

Key considerations:

Polars is used instead of pandas for memory efficiency on large datasets
The polars-grouper library efficiently computes connected components
Only buckets with more than one document generate candidate pairs

Step 5: False Positive Verification

Optionally verify candidate duplicate pairs by computing actual Jaccard similarity on their n-gram sets. For each cluster, all pairs of documents are compared, and only those exceeding the similarity threshold are retained as true duplicates. The cluster assignments are updated to reflect verified relationships, and false positive statistics are logged.

Key considerations:

This step is optional (controlled by check_false_positive config flag)
Verification uses pairwise Jaccard similarity on n-gram token sets
False positive counts and verified cluster statistics are reported

Step 6: Duplicate Removal and Output

Filter the dataset to keep only one representative document per cluster (the document whose internal index equals its cluster ID). Save the deduplicated dataset to disk in HuggingFace Dataset format, optionally saving cluster assignments as a pickle file. Clean up cache files if configured.

Key considerations:

The skip_filtering option allows saving cluster metadata without removing duplicates
Cluster assignments can be saved separately for analysis
Cache cleanup is optional to preserve intermediate results

Execution Diagram

GitHub URL

Workflow Repository