Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:ChenghaoMou Text dedup MinHash Fingerprinting

From Leeroopedia
Knowledge Sources
Domains Hashing, NLP, Deduplication
Last Updated 2026-02-14 21:00 GMT

Overview

A locality-sensitive hashing technique that generates compact fixed-size signatures from document n-grams to estimate Jaccard similarity in sub-linear time.

Description

MinHash Fingerprinting converts each document into a compact signature that preserves set similarity. For each document: (1) tokenize into word n-grams, (2) hash each n-gram with a base hash function, (3) apply num_perm random universal hash permutations of the form h(x) = (a*x + b) mod p mod max_hash, and (4) take the minimum hash value across all n-grams for each permutation. The resulting signature is a vector of num_perm minimum hash values.

The key property is that Pr[MinHash(A) = MinHash(B)] = Jaccard(A, B), making the signature an unbiased estimator of Jaccard similarity. The signature is then split into bands of rows each for Locality-Sensitive Hashing (LSH) in the clustering step.

Usage

Use this principle when performing near-duplicate detection on large text corpora. MinHash is the preferred fingerprinting method when approximate Jaccard similarity is the desired similarity metric and the corpus is too large for pairwise comparisons.

Theoretical Basis

The MinHash algorithm relies on the following key equations:

For a set S of n-gram hashes and a random hash permutation h: MinHashh(S)=minxSh(x)

Universal hashing family: ha,b(x)=((ax+b)modp)modmax_hash

Where a is drawn uniformly from [1, p) and b from [0, p), with p being a prime larger than the hash space.

The full signature for num_perm permutations:

# Abstract MinHash algorithm (NOT real implementation)
for each document:
    tokens = ngrams(tokenize(text), n)
    hashes = [hash_func(token) for token in tokens]
    for i in range(num_perm):
        signature[i] = min((a[i] * h + b[i]) % prime % max_hash for h in hashes)
    # Split signature into bands for LSH
    bands = [signature[start:end] for (start, end) in band_ranges]

Related Pages

Implemented By

Uses Heuristic

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment