Implementation:ChenghaoMou Text dedup SimHash Get Embed Func
| Knowledge Sources | |
|---|---|
| Domains | Hashing, NLP, Deduplication |
| Last Updated | 2026-02-14 21:00 GMT |
Overview
Concrete tool for generating SimHash fingerprints with bit-permutation bucket keys from document text provided by SimHashAlgorithmConfig.
Description
The get_embed_func' method on SimHashAlgorithmConfig returns a closure that: (1) tokenizes text into byte n-grams, (2) hashes each n-gram using xxh3 into a bitarray, (3) calls compute() to sum weighted bit vectors and threshold into the SimHash fingerprint, (4) applies each pre-computed bit Permutation to generate bucket keys (search_mask bytes + permuted prefix bytes), and (5) returns __key__ (list of (mask, prefix) tuples), __val__ (raw signature bytes), and ' per permutation.
The Permutation class implements bit-block permutation following the simhash-py architecture: blocks of the fingerprint are reordered so that b-k blocks form a prefix for bucketing, allowing Hamming distance <= k detection.
Usage
Use this when executing the SimHash fingerprinting step of the SimHash deduplication pipeline.
Code Reference
Source Location
- Repository: text-dedup
- File: src/text_dedup/config/algorithms/simhash.py
- Lines: L328-380 (get_embed_func), L222-253 (compute), L189-219 (_unsigned_hash), L70-161 (Permutation), L24-68 (Mask)
Signature
class SimHashAlgorithmConfig(AlgorithmConfig):
def get_embed_func(
self,
) -> Callable[[list[str], list[int]], dict[str, list[int | bytes]]]:
"""Create a function that computes SimHash fingerprints with
bit-permutation bucket keys.
Returns
-------
Callable
Closure capturing tokenizer, hash_func, permutations.
"""
def compute(hashes: list[bitarray]) -> bitarray:
"""Compute the SimHash of a list of token hashes.
Parameters
----------
hashes : list[bitarray]
List of per-token hash bitarrays.
Returns
-------
bitarray
The aggregated SimHash fingerprint.
"""
Import
from text_dedup.config import SimHashAlgorithmConfig
from text_dedup.config.algorithms.simhash import compute, Permutation, Mask
from text_dedup.simhash import fingerprint
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| text_col | list[str] | Yes | Batch of text strings (batch_size=1) |
| idx_col | list[int] | Yes | Batch of document indices |
Outputs
| Name | Type | Description |
|---|---|---|
| __key__ | list[tuple[bytes, bytes]] | Per-permutation bucket keys (mask_bytes, permuted_prefix_bytes) |
| __val__ | list[bytes] | Raw SimHash signature bytes (repeated per permutation) |
| list[int] | Document index (repeated per permutation) |
Usage Examples
Fingerprinting a Dataset
from text_dedup.config.base import load_config_from_toml
from text_dedup.simhash import fingerprint, load_and_preprocess
from pathlib import Path
config = load_config_from_toml(Path("configs/simhash.toml"))
ds, original_len = load_and_preprocess(config)
# Generate SimHash fingerprints with bucket keys
embedded = fingerprint(config, ds)
print(embedded.column_names) # ['__key__', '__val__', '__INDEX__']