Principle:ChenghaoMou Text dedup SimHash Fingerprinting

Knowledge Sources	SimHash Detecting Near-Duplicates for Web Crawling text-dedup
Domains	Hashing, NLP, Deduplication
Last Updated	2026-02-14 21:00 GMT

Overview

A dimensionality reduction technique that maps high-dimensional feature vectors to compact binary fingerprints where Hamming distance approximates cosine similarity.

Description

SimHash (Similarity-Preserving Hashing) was introduced by Charikar (2002) and applied to web-scale near-duplicate detection by Manku et al. (2007). Unlike MinHash which estimates Jaccard similarity, SimHash generates a single fixed-length binary fingerprint per document by: (1) hashing each n-gram token to a binary vector, (2) weighting each bit position (+1 for 1-bits, -1 for 0-bits), (3) summing the weighted vectors across all tokens, and (4) thresholding the sum to produce the final binary fingerprint.

Two documents are considered near-duplicates if their Hamming distance (number of differing bits) is at most k. The fingerprint is then processed through bit permutations to enable efficient bucketing for candidate detection.

Usage

Use this principle when near-duplicate detection requires binary fingerprints and Hamming distance comparison. SimHash is faster than MinHash for single-document fingerprinting but may have lower recall for documents with partial overlap.

Theoretical Basis

The SimHash computation for a document with n-gram set T:

# Abstract SimHash algorithm (NOT real implementation)
weights = [0] * f  # f-bit fingerprint
for token in ngrams:
    h = hash(token)  # f-bit hash
    for i in range(f):
        if h[i] == 1:
            weights[i] += 1
        else:
            weights[i] -= 1
fingerprint = [1 if w > 0 else 0 for w in weights]

Hamming distance property: $d_{H} (S i m H a s h (A), S i m H a s h (B)) \approx \frac{1}{π} \arccos (cos (A, B)) \cdot f$

Two documents are candidates if d_H <= k (the bit_diff parameter).

Related Pages

Implemented By

Implementation:ChenghaoMou_Text_dedup_SimHash_Get_Embed_Func

Uses Heuristic

Heuristic:ChenghaoMou_Text_dedup_SimHash_Optimization_Ceiling

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment