Principle:ChenghaoMou Text dedup SimHash Fingerprinting
| Knowledge Sources | |
|---|---|
| Domains | Hashing, NLP, Deduplication |
| Last Updated | 2026-02-14 21:00 GMT |
Overview
A dimensionality reduction technique that maps high-dimensional feature vectors to compact binary fingerprints where Hamming distance approximates cosine similarity.
Description
SimHash (Similarity-Preserving Hashing) was introduced by Charikar (2002) and applied to web-scale near-duplicate detection by Manku et al. (2007). Unlike MinHash which estimates Jaccard similarity, SimHash generates a single fixed-length binary fingerprint per document by: (1) hashing each n-gram token to a binary vector, (2) weighting each bit position (+1 for 1-bits, -1 for 0-bits), (3) summing the weighted vectors across all tokens, and (4) thresholding the sum to produce the final binary fingerprint.
Two documents are considered near-duplicates if their Hamming distance (number of differing bits) is at most k. The fingerprint is then processed through bit permutations to enable efficient bucketing for candidate detection.
Usage
Use this principle when near-duplicate detection requires binary fingerprints and Hamming distance comparison. SimHash is faster than MinHash for single-document fingerprinting but may have lower recall for documents with partial overlap.
Theoretical Basis
The SimHash computation for a document with n-gram set T:
# Abstract SimHash algorithm (NOT real implementation)
weights = [0] * f # f-bit fingerprint
for token in ngrams:
h = hash(token) # f-bit hash
for i in range(f):
if h[i] == 1:
weights[i] += 1
else:
weights[i] -= 1
fingerprint = [1 if w > 0 else 0 for w in weights]
Hamming distance property:
Two documents are candidates if d_H <= k (the bit_diff parameter).