Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:ChenghaoMou Text dedup SimHash Get Embed Func

From Leeroopedia
Knowledge Sources
Domains Hashing, NLP, Deduplication
Last Updated 2026-02-14 21:00 GMT

Overview

Concrete tool for generating SimHash fingerprints with bit-permutation bucket keys from document text provided by SimHashAlgorithmConfig.

Description

The get_embed_func' method on SimHashAlgorithmConfig returns a closure that: (1) tokenizes text into byte n-grams, (2) hashes each n-gram using xxh3 into a bitarray, (3) calls compute() to sum weighted bit vectors and threshold into the SimHash fingerprint, (4) applies each pre-computed bit Permutation to generate bucket keys (search_mask bytes + permuted prefix bytes), and (5) returns __key__ (list of (mask, prefix) tuples), __val__ (raw signature bytes), and ' per permutation.

The Permutation class implements bit-block permutation following the simhash-py architecture: blocks of the fingerprint are reordered so that b-k blocks form a prefix for bucketing, allowing Hamming distance <= k detection.

Usage

Use this when executing the SimHash fingerprinting step of the SimHash deduplication pipeline.

Code Reference

Source Location

  • Repository: text-dedup
  • File: src/text_dedup/config/algorithms/simhash.py
  • Lines: L328-380 (get_embed_func), L222-253 (compute), L189-219 (_unsigned_hash), L70-161 (Permutation), L24-68 (Mask)

Signature

class SimHashAlgorithmConfig(AlgorithmConfig):
    def get_embed_func(
        self,
    ) -> Callable[[list[str], list[int]], dict[str, list[int | bytes]]]:
        """Create a function that computes SimHash fingerprints with
        bit-permutation bucket keys.

        Returns
        -------
        Callable
            Closure capturing tokenizer, hash_func, permutations.
        """

def compute(hashes: list[bitarray]) -> bitarray:
    """Compute the SimHash of a list of token hashes.

    Parameters
    ----------
    hashes : list[bitarray]
        List of per-token hash bitarrays.

    Returns
    -------
    bitarray
        The aggregated SimHash fingerprint.
    """

Import

from text_dedup.config import SimHashAlgorithmConfig
from text_dedup.config.algorithms.simhash import compute, Permutation, Mask
from text_dedup.simhash import fingerprint

I/O Contract

Inputs

Name Type Required Description
text_col list[str] Yes Batch of text strings (batch_size=1)
idx_col list[int] Yes Batch of document indices

Outputs

Name Type Description
__key__ list[tuple[bytes, bytes]] Per-permutation bucket keys (mask_bytes, permuted_prefix_bytes)
__val__ list[bytes] Raw SimHash signature bytes (repeated per permutation)
list[int] Document index (repeated per permutation)

Usage Examples

Fingerprinting a Dataset

from text_dedup.config.base import load_config_from_toml
from text_dedup.simhash import fingerprint, load_and_preprocess
from pathlib import Path

config = load_config_from_toml(Path("configs/simhash.toml"))
ds, original_len = load_and_preprocess(config)

# Generate SimHash fingerprints with bucket keys
embedded = fingerprint(config, ds)
print(embedded.column_names)  # ['__key__', '__val__', '__INDEX__']

Related Pages

Implements Principle

Requires Environment

Uses Heuristic

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment