Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Principle:ChenghaoMou Text dedup Jaccard Similarity

From Leeroopedia
Knowledge Sources
Domains Similarity_Metrics, Deduplication
Last Updated 2026-02-14 21:00 GMT

Overview

A set similarity metric that measures the overlap between two token sets as the ratio of their intersection to their union.

Description

The Jaccard similarity coefficient (also called Jaccard index) is a fundamental similarity metric for comparing two sets. For text deduplication, it measures how similar two documents are by comparing their sets of n-gram tokens. A Jaccard similarity of 1.0 means identical token sets; 0.0 means no overlap.

In text-dedup, Jaccard similarity serves two roles: (1) it is the target metric that MinHash approximates probabilistically, and (2) it is the exact computation used in false positive verification to confirm that candidate pairs identified by MinHash or SimHash are truly similar.

Usage

Use this metric whenever exact set similarity between two documents needs to be computed, particularly during false positive verification steps in MinHash and SimHash pipelines.

Theoretical Basis

J(A,B)=|AB||AB|

Properties:

  • J(A, A) = 1 (identity)
  • J(A, B) = J(B, A) (symmetry)
  • J(A, ∅) = 0 if A is non-empty (by convention, J(∅, ∅) = 1 in text-dedup)
  • 0 ≤ J(A, B) ≤ 1 (bounded)

Related Pages

Implemented By

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment