Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:ChenghaoMou Text dedup Jaccard Similarity Func

From Leeroopedia
Knowledge Sources
Domains Similarity_Metrics, Deduplication
Last Updated 2026-02-14 21:00 GMT

Overview

Concrete tool for computing Jaccard similarity between two sets of tokens provided by text-dedup.

Description

The jaccard_similarity function computes the Jaccard coefficient between two sets of string or byte tokens using Python set operations. It handles the edge case of two empty sets by returning 1.0. The function is used in false positive verification for both MinHash and SimHash pipelines.

A companion function cluster_jaccard_similarity computes pairwise similarities within a cluster for reporting purposes.

Usage

Import this function when computing exact Jaccard similarity between two tokenized documents during false positive verification.

Code Reference

Source Location

  • Repository: text-dedup
  • File: src/text_dedup/utils/jaccard.py
  • Lines: L9-47

Signature

def jaccard_similarity(
    doc1: set[str] | set[bytes],
    doc2: set[str] | set[bytes],
) -> float:
    """Compute the Jaccard similarity between two sets of tokens.

    Parameters
    ----------
    doc1 : set[str] | set[bytes]
        The first set of tokens.
    doc2 : set[str] | set[bytes]
        The second set of tokens.

    Returns
    -------
    float
        The Jaccard similarity (0.0 to 1.0).
    """

def cluster_jaccard_similarity(
    cluster: list[set[bytes]],
    threshold: float,
) -> tuple[list[float], float]:
    """Compute pairwise Jaccard similarities within a cluster.

    Returns
    -------
    tuple[list[float], float]
        Per-document max similarity and false positive rate.
    """

Import

from text_dedup.utils.jaccard import jaccard_similarity, cluster_jaccard_similarity

I/O Contract

Inputs

Name Type Required Description
doc1 set[str] or set[bytes] Yes First set of tokens
doc2 set[str] or set[bytes] Yes Second set of tokens

Outputs

Name Type Description
similarity float Jaccard similarity coefficient (0.0 to 1.0)

Usage Examples

Computing Jaccard Similarity

from text_dedup.utils.jaccard import jaccard_similarity

doc1_tokens = {b"hello world", b"foo bar", b"test data"}
doc2_tokens = {b"hello world", b"foo bar", b"other data"}

sim = jaccard_similarity(doc1_tokens, doc2_tokens)
print(f"Similarity: {sim:.4f}")  # 0.5000 (2 shared out of 4 total)

Related Pages

Implements Principle

Requires Environment

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment