Principle:ChenghaoMou Text dedup MinHash Fingerprinting
| Knowledge Sources | |
|---|---|
| Domains | Hashing, NLP, Deduplication |
| Last Updated | 2026-02-14 21:00 GMT |
Overview
A locality-sensitive hashing technique that generates compact fixed-size signatures from document n-grams to estimate Jaccard similarity in sub-linear time.
Description
MinHash Fingerprinting converts each document into a compact signature that preserves set similarity. For each document: (1) tokenize into word n-grams, (2) hash each n-gram with a base hash function, (3) apply num_perm random universal hash permutations of the form h(x) = (a*x + b) mod p mod max_hash, and (4) take the minimum hash value across all n-grams for each permutation. The resulting signature is a vector of num_perm minimum hash values.
The key property is that Pr[MinHash(A) = MinHash(B)] = Jaccard(A, B), making the signature an unbiased estimator of Jaccard similarity. The signature is then split into bands of rows each for Locality-Sensitive Hashing (LSH) in the clustering step.
Usage
Use this principle when performing near-duplicate detection on large text corpora. MinHash is the preferred fingerprinting method when approximate Jaccard similarity is the desired similarity metric and the corpus is too large for pairwise comparisons.
Theoretical Basis
The MinHash algorithm relies on the following key equations:
For a set S of n-gram hashes and a random hash permutation h:
Universal hashing family:
Where a is drawn uniformly from [1, p) and b from [0, p), with p being a prime larger than the hash space.
The full signature for num_perm permutations:
# Abstract MinHash algorithm (NOT real implementation)
for each document:
tokens = ngrams(tokenize(text), n)
hashes = [hash_func(token) for token in tokens]
for i in range(num_perm):
signature[i] = min((a[i] * h + b[i]) % prime % max_hash for h in hashes)
# Split signature into bands for LSH
bands = [signature[start:end] for (start, end) in band_ranges]