Implementation:ChenghaoMou Text dedup Jaccard Similarity Func

Knowledge Sources	text-dedup
Domains	Similarity_Metrics, Deduplication
Last Updated	2026-02-14 21:00 GMT

Overview

Concrete tool for computing Jaccard similarity between two sets of tokens provided by text-dedup.

Description

The jaccard_similarity function computes the Jaccard coefficient between two sets of string or byte tokens using Python set operations. It handles the edge case of two empty sets by returning 1.0. The function is used in false positive verification for both MinHash and SimHash pipelines.

A companion function cluster_jaccard_similarity computes pairwise similarities within a cluster for reporting purposes.

Usage

Import this function when computing exact Jaccard similarity between two tokenized documents during false positive verification.

Code Reference

Source Location

Repository: text-dedup
File: src/text_dedup/utils/jaccard.py
Lines: L9-47

Signature

def jaccard_similarity(
    doc1: set[str] | set[bytes],
    doc2: set[str] | set[bytes],
) -> float:
    """Compute the Jaccard similarity between two sets of tokens.

    Parameters
    ----------
    doc1 : set[str] | set[bytes]
        The first set of tokens.
    doc2 : set[str] | set[bytes]
        The second set of tokens.

    Returns
    -------
    float
        The Jaccard similarity (0.0 to 1.0).
    """

def cluster_jaccard_similarity(
    cluster: list[set[bytes]],
    threshold: float,
) -> tuple[list[float], float]:
    """Compute pairwise Jaccard similarities within a cluster.

    Returns
    -------
    tuple[list[float], float]
        Per-document max similarity and false positive rate.
    """

Import

from text_dedup.utils.jaccard import jaccard_similarity, cluster_jaccard_similarity

I/O Contract

Inputs

Name	Type	Required	Description
doc1	set[str] or set[bytes]	Yes	First set of tokens
doc2	set[str] or set[bytes]	Yes	Second set of tokens

Outputs

Name	Type	Description
similarity	float	Jaccard similarity coefficient (0.0 to 1.0)

Usage Examples

Computing Jaccard Similarity

from text_dedup.utils.jaccard import jaccard_similarity

doc1_tokens = {b"hello world", b"foo bar", b"test data"}
doc2_tokens = {b"hello world", b"foo bar", b"other data"}

sim = jaccard_similarity(doc1_tokens, doc2_tokens)
print(f"Similarity: {sim:.4f}")  # 0.5000 (2 shared out of 4 total)

Related Pages

Implements Principle

Principle:ChenghaoMou_Text_dedup_Jaccard_Similarity

Requires Environment

Environment:ChenghaoMou_Text_dedup_Python_3_12_Environment

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment