Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:ChenghaoMou Text dedup Bloom Filter Deduplication

From Leeroopedia
Knowledge Sources
Domains Data_Structures, Deduplication
Last Updated 2026-02-14 21:00 GMT

Overview

A space-efficient probabilistic data structure that performs streaming exact-match deduplication by testing document membership in a set with a controlled false positive rate.

Description

Bloom Filter Deduplication uses a probabilistic set data structure to detect exact duplicate documents in a single pass. For each document, the filter checks if the full text has been seen before (membership test). If yes, the document is marked as a duplicate. If no, the text is added to the filter. The Bloom filter provides: (1) constant-time O(1) membership testing and insertion, (2) guaranteed zero false negatives (a seen document will always be detected), (3) controlled false positive rate determined by filter capacity and error rate parameters.

The key constraint is that Bloom filters are stateful and not parallelizable — documents must be processed sequentially (num_proc=1) to maintain correctness.

Usage

Use this principle when exact-match deduplication is needed and memory efficiency is more important than supporting near-duplicate detection. Ideal for large corpora where exact copies need to be removed in a single streaming pass.

Theoretical Basis

A Bloom filter uses k hash functions and an m-bit array:

# Abstract Bloom filter operation (NOT real implementation)
for each document:
    if text in bloom_filter:  # All k hash positions are set
        mark_duplicate(document)
    else:
        bloom_filter.add(text)  # Set k hash positions

The false positive rate for n inserted elements: p(1ekn/m)k

Related Pages

Implemented By

Uses Heuristic

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment