Implementation:ChenghaoMou Text dedup MinHash Get Embed Func
| Knowledge Sources | |
|---|---|
| Domains | Hashing, NLP, Deduplication |
| Last Updated | 2026-02-14 21:00 GMT |
Overview
Concrete tool for generating MinHash band signatures from document text provided by MinHashAlgorithmConfig.
Description
The get_embed_func' method on MinHashAlgorithmConfig returns a closure that: (1) tokenizes text into byte n-grams, (2) hashes each n-gram with xxh3 or sha1, (3) applies universal hash permutations via vectorized numpy operations (hashvalues * a + b) % prime & max_hash, (4) takes column-wise minimums to produce the MinHash signature, and (5) splits the signature into bands, returning one row per band with __band_idx__, __band_val__ (bytes), and '.
The fingerprint function in minhash.py calls Dataset.map with this embed function in batched mode (batch_size=1) to process each document.
Usage
Use this when executing the MinHash fingerprinting step of the MinHash LSH deduplication pipeline, after configuration loading and data preprocessing.
Code Reference
Source Location
- Repository: text-dedup
- File: src/text_dedup/config/algorithms/minhash.py
- Lines: L200-238
Signature
class MinHashAlgorithmConfig(AlgorithmConfig):
def get_embed_func(
self,
) -> Callable[[list[str], list[int]], dict[str, list[int | bytes]]]:
"""Create a function that embeds a string into a list of
(band_idx, band_val, index) tuples.
Returns
-------
Callable
Closure capturing permutations, hash_func, hash_ranges, ngrams_func.
"""
Import
from text_dedup.config import MinHashAlgorithmConfig
from text_dedup.minhash import fingerprint
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| text_col | list[str] | Yes | Batch of text strings (batch_size=1, so single element) |
| idx_col | list[int] | Yes | Batch of document indices |
Outputs
| Name | Type | Description |
|---|---|---|
| __band_idx__ | list[int] | Band index for each band (0 to num_bands-1) |
| __band_val__ | list[bytes] | Band signature bytes for each band |
| list[int] | Document index repeated for each band |
Usage Examples
Fingerprinting a Dataset
from text_dedup.config.base import load_config_from_toml
from text_dedup.minhash import fingerprint, load_and_preprocess
from pathlib import Path
config = load_config_from_toml(Path("configs/minhash.toml"))
ds, original_len, filtered_len = load_and_preprocess(config)
# Generate MinHash band signatures
embedded = fingerprint(config, ds)
print(embedded.column_names) # ['__band_idx__', '__band_val__', '__INDEX__']