Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:ChenghaoMou Text dedup Jaccard Similarity

From Leeroopedia
Knowledge Sources
Domains Similarity_Metrics, Deduplication
Last Updated 2026-02-14 21:00 GMT

Overview

A set similarity metric that measures the overlap between two token sets as the ratio of their intersection to their union.

Description

The Jaccard similarity coefficient (also called Jaccard index) is a fundamental similarity metric for comparing two sets. For text deduplication, it measures how similar two documents are by comparing their sets of n-gram tokens. A Jaccard similarity of 1.0 means identical token sets; 0.0 means no overlap.

In text-dedup, Jaccard similarity serves two roles: (1) it is the target metric that MinHash approximates probabilistically, and (2) it is the exact computation used in false positive verification to confirm that candidate pairs identified by MinHash or SimHash are truly similar.

Usage

Use this metric whenever exact set similarity between two documents needs to be computed, particularly during false positive verification steps in MinHash and SimHash pipelines.

Theoretical Basis

J(A,B)=|AB||AB|

Properties:

  • J(A, A) = 1 (identity)
  • J(A, B) = J(B, A) (symmetry)
  • J(A, ∅) = 0 if A is non-empty (by convention, J(∅, ∅) = 1 in text-dedup)
  • 0 ≤ J(A, B) ≤ 1 (bounded)

Related Pages

Implemented By

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment