Principle:ChenghaoMou Text dedup Jaccard Similarity
| Knowledge Sources | |
|---|---|
| Domains | Similarity_Metrics, Deduplication |
| Last Updated | 2026-02-14 21:00 GMT |
Overview
A set similarity metric that measures the overlap between two token sets as the ratio of their intersection to their union.
Description
The Jaccard similarity coefficient (also called Jaccard index) is a fundamental similarity metric for comparing two sets. For text deduplication, it measures how similar two documents are by comparing their sets of n-gram tokens. A Jaccard similarity of 1.0 means identical token sets; 0.0 means no overlap.
In text-dedup, Jaccard similarity serves two roles: (1) it is the target metric that MinHash approximates probabilistically, and (2) it is the exact computation used in false positive verification to confirm that candidate pairs identified by MinHash or SimHash are truly similar.
Usage
Use this metric whenever exact set similarity between two documents needs to be computed, particularly during false positive verification steps in MinHash and SimHash pipelines.
Theoretical Basis
Properties:
- J(A, A) = 1 (identity)
- J(A, B) = J(B, A) (symmetry)
- J(A, ∅) = 0 if A is non-empty (by convention, J(∅, ∅) = 1 in text-dedup)
- 0 ≤ J(A, B) ≤ 1 (bounded)