Implementation:ChenghaoMou Text dedup Bloom Filter Func
| Knowledge Sources | |
|---|---|
| Domains | Data_Structures, Deduplication |
| Last Updated | 2026-02-14 21:00 GMT |
Overview
Concrete tool for streaming exact-match deduplication using the rbloom Bloom filter library provided by text-dedup.
Description
The bloom_filter function in bloom_filter.py creates an rbloom.Bloom filter via BloomFilterAlgorithmConfig.get_filter() and processes the dataset sequentially (num_proc=1, since the Bloom filter is stateful). For each document text, it checks membership (text in bf) and adds it if not seen. The result is a duplicate boolean column on the dataset.
The BloomFilterAlgorithmConfig class configures the filter with max_elements (expected capacity) and error_rate (target false positive probability), which are passed directly to rbloom.Bloom.
Usage
Import this when running the Bloom filter deduplication pipeline for exact-match deduplication.
Code Reference
Source Location
- Repository: text-dedup
- File: src/text_dedup/bloom_filter.py (L23-47), src/text_dedup/config/algorithms/bloom.py (L10-16)
Signature
def bloom_filter(config: Config, ds: Dataset) -> Dataset:
"""Apply Bloom filter for exact-match deduplication.
Parameters
----------
config : Config
Pipeline configuration with BloomFilter settings.
ds : Dataset
Input dataset with text column.
Returns
-------
Dataset
Dataset with 'duplicate' boolean column added.
"""
class BloomFilterAlgorithmConfig(AlgorithmConfig):
algo_name: Literal["bloomfilter"] = "bloomfilter"
max_elements: int
error_rate: float
def get_filter(self) -> Bloom:
"""Create a new rbloom.Bloom filter."""
return Bloom(self.max_elements, self.error_rate)
Import
from text_dedup.bloom_filter import bloom_filter
from text_dedup.config import BloomFilterAlgorithmConfig
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| config | Config | Yes | Pipeline configuration with Bloom filter settings |
| ds | Dataset | Yes | Input dataset with text column |
Outputs
| Name | Type | Description |
|---|---|---|
| Dataset | Dataset | Dataset with added 'duplicate' boolean column |
Usage Examples
Running Bloom Filter Deduplication
from text_dedup.bloom_filter import bloom_filter, load_and_preprocess, remove_duplicates
from text_dedup.config.base import load_config_from_toml
from pathlib import Path
config = load_config_from_toml(Path("configs/bloom_filter.toml"))
ds, original_len = load_and_preprocess(config)
# Index documents through Bloom filter
ds = bloom_filter(config, ds)
print(ds.column_names) # [..., 'duplicate']
# Remove duplicates
final_data = remove_duplicates(config, ds)
print(f"Removed {original_len - len(final_data)} duplicates")