Implementation:ChenghaoMou Text dedup Jaccard Similarity Func
| Knowledge Sources | |
|---|---|
| Domains | Similarity_Metrics, Deduplication |
| Last Updated | 2026-02-14 21:00 GMT |
Overview
Concrete tool for computing Jaccard similarity between two sets of tokens provided by text-dedup.
Description
The jaccard_similarity function computes the Jaccard coefficient between two sets of string or byte tokens using Python set operations. It handles the edge case of two empty sets by returning 1.0. The function is used in false positive verification for both MinHash and SimHash pipelines.
A companion function cluster_jaccard_similarity computes pairwise similarities within a cluster for reporting purposes.
Usage
Import this function when computing exact Jaccard similarity between two tokenized documents during false positive verification.
Code Reference
Source Location
- Repository: text-dedup
- File: src/text_dedup/utils/jaccard.py
- Lines: L9-47
Signature
def jaccard_similarity(
doc1: set[str] | set[bytes],
doc2: set[str] | set[bytes],
) -> float:
"""Compute the Jaccard similarity between two sets of tokens.
Parameters
----------
doc1 : set[str] | set[bytes]
The first set of tokens.
doc2 : set[str] | set[bytes]
The second set of tokens.
Returns
-------
float
The Jaccard similarity (0.0 to 1.0).
"""
def cluster_jaccard_similarity(
cluster: list[set[bytes]],
threshold: float,
) -> tuple[list[float], float]:
"""Compute pairwise Jaccard similarities within a cluster.
Returns
-------
tuple[list[float], float]
Per-document max similarity and false positive rate.
"""
Import
from text_dedup.utils.jaccard import jaccard_similarity, cluster_jaccard_similarity
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| doc1 | set[str] or set[bytes] | Yes | First set of tokens |
| doc2 | set[str] or set[bytes] | Yes | Second set of tokens |
Outputs
| Name | Type | Description |
|---|---|---|
| similarity | float | Jaccard similarity coefficient (0.0 to 1.0) |
Usage Examples
Computing Jaccard Similarity
from text_dedup.utils.jaccard import jaccard_similarity
doc1_tokens = {b"hello world", b"foo bar", b"test data"}
doc2_tokens = {b"hello world", b"foo bar", b"other data"}
sim = jaccard_similarity(doc1_tokens, doc2_tokens)
print(f"Similarity: {sim:.4f}") # 0.5000 (2 shared out of 4 total)