Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Principle:Huggingface Datatrove Bloom Filter Deduplication

From Leeroopedia
Knowledge Sources
Domains Deduplication, Data Processing
Last Updated 2026-02-14 17:00 GMT

Overview

Bloom filter deduplication is a space-efficient probabilistic technique for identifying and removing near-duplicate documents in a single streaming pass over a dataset.

Description

Near-duplicate detection in large text corpora is essential for training high-quality language models, as duplicate content leads to memorization, reduced diversity, and wasted compute. Exact deduplication requires storing hashes of all seen documents, which grows linearly with dataset size. Bloom filter deduplication provides a constant-memory alternative by using a probabilistic data structure that can test set membership with configurable false positive rates but zero false negatives.

Rather than hashing entire documents (which only catches exact duplicates), the approach uses n-gram shingles: overlapping sequences of n words from normalized text. Two documents are considered near-duplicates if a sufficient fraction of their shingles have been seen before. This shingle-based approach detects documents that share most of their content even if they differ in headers, footers, or minor edits.

The datatrove implementation uses a bytearray-backed Bloom filter with k universal hash functions parametrized by random coefficients modulo a Mersenne prime. Each shingle is hashed to produce k bit positions; setting these bits records the shingle's presence, and querying checks whether all k bits are set.

Usage

Use Bloom filter deduplication when you need memory-efficient, single-pass deduplication for large datasets. It is ideal for scenarios where the dataset is too large for exact deduplication and a small false positive rate (incorrectly marking unique documents as duplicates) is acceptable.

Theoretical Basis

Bloom Filter: A Bloom filter is a bit array of m bits, initially all zero, with k independent hash functions mapping elements to bit positions. To add an element, set all k mapped bits to 1. To query, check whether all k bits are 1 -- if any bit is 0, the element is definitely not in the set; if all are 1, the element is probably in the set (with a calculable false positive probability).

False Positive Probability: After inserting n elements into a Bloom filter with m bits and k hash functions, the false positive probability is approximately:

FP = (1 - (1 - 1/m)^(kn))^k = (1 - e^(-kn/m))^k

Optimal Number of Hash Functions: Given m bits and n expected elements, the optimal k that minimizes the false positive rate is:

k_opt = (m/n) * ln(2)

Universal Hashing: The k hash functions are implemented as h_i(x) = ((a_i * x + b_i) mod p) mod m, where p is a Mersenne prime (2^61 - 1) and a_i, b_i are random coefficients. This family of functions provides uniform, independent-like hash distributions.

Shingle-Based Near-Duplicate Detection: A document's shingle set is the set of all n-gram subsequences of its normalized text. The Jaccard similarity between two documents' shingle sets estimates their textual overlap. The Bloom filter approximates this by tracking which shingles have been seen globally. If more than a threshold fraction (e.g., 80%) of a document's shingles are already in the filter, the document is likely a near-duplicate of one or more previously seen documents.

Streaming vs. Batch: Unlike MinHash-based deduplication (which requires multiple passes or distributed coordination), Bloom filter deduplication processes documents in a single streaming pass. The tradeoff is that it only identifies duplicates relative to previously seen documents -- the order of processing affects which copy is kept.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment