Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:ChenghaoMou Text dedup Bloom Filter Func

From Leeroopedia
Knowledge Sources
Domains Data_Structures, Deduplication
Last Updated 2026-02-14 21:00 GMT

Overview

Concrete tool for streaming exact-match deduplication using the rbloom Bloom filter library provided by text-dedup.

Description

The bloom_filter function in bloom_filter.py creates an rbloom.Bloom filter via BloomFilterAlgorithmConfig.get_filter() and processes the dataset sequentially (num_proc=1, since the Bloom filter is stateful). For each document text, it checks membership (text in bf) and adds it if not seen. The result is a duplicate boolean column on the dataset.

The BloomFilterAlgorithmConfig class configures the filter with max_elements (expected capacity) and error_rate (target false positive probability), which are passed directly to rbloom.Bloom.

Usage

Import this when running the Bloom filter deduplication pipeline for exact-match deduplication.

Code Reference

Source Location

  • Repository: text-dedup
  • File: src/text_dedup/bloom_filter.py (L23-47), src/text_dedup/config/algorithms/bloom.py (L10-16)

Signature

def bloom_filter(config: Config, ds: Dataset) -> Dataset:
    """Apply Bloom filter for exact-match deduplication.

    Parameters
    ----------
    config : Config
        Pipeline configuration with BloomFilter settings.
    ds : Dataset
        Input dataset with text column.

    Returns
    -------
    Dataset
        Dataset with 'duplicate' boolean column added.
    """

class BloomFilterAlgorithmConfig(AlgorithmConfig):
    algo_name: Literal["bloomfilter"] = "bloomfilter"
    max_elements: int
    error_rate: float

    def get_filter(self) -> Bloom:
        """Create a new rbloom.Bloom filter."""
        return Bloom(self.max_elements, self.error_rate)

Import

from text_dedup.bloom_filter import bloom_filter
from text_dedup.config import BloomFilterAlgorithmConfig

I/O Contract

Inputs

Name Type Required Description
config Config Yes Pipeline configuration with Bloom filter settings
ds Dataset Yes Input dataset with text column

Outputs

Name Type Description
Dataset Dataset Dataset with added 'duplicate' boolean column

Usage Examples

Running Bloom Filter Deduplication

from text_dedup.bloom_filter import bloom_filter, load_and_preprocess, remove_duplicates
from text_dedup.config.base import load_config_from_toml
from pathlib import Path

config = load_config_from_toml(Path("configs/bloom_filter.toml"))
ds, original_len = load_and_preprocess(config)

# Index documents through Bloom filter
ds = bloom_filter(config, ds)
print(ds.column_names)  # [..., 'duplicate']

# Remove duplicates
final_data = remove_duplicates(config, ds)
print(f"Removed {original_len - len(final_data)} duplicates")

Related Pages

Implements Principle

Requires Environment

Uses Heuristic

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment