Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:ChenghaoMou Text dedup MinHash Get Embed Func

From Leeroopedia
Knowledge Sources
Domains Hashing, NLP, Deduplication
Last Updated 2026-02-14 21:00 GMT

Overview

Concrete tool for generating MinHash band signatures from document text provided by MinHashAlgorithmConfig.

Description

The get_embed_func' method on MinHashAlgorithmConfig returns a closure that: (1) tokenizes text into byte n-grams, (2) hashes each n-gram with xxh3 or sha1, (3) applies universal hash permutations via vectorized numpy operations (hashvalues * a + b) % prime & max_hash, (4) takes column-wise minimums to produce the MinHash signature, and (5) splits the signature into bands, returning one row per band with __band_idx__, __band_val__ (bytes), and '.

The fingerprint function in minhash.py calls Dataset.map with this embed function in batched mode (batch_size=1) to process each document.

Usage

Use this when executing the MinHash fingerprinting step of the MinHash LSH deduplication pipeline, after configuration loading and data preprocessing.

Code Reference

Source Location

  • Repository: text-dedup
  • File: src/text_dedup/config/algorithms/minhash.py
  • Lines: L200-238

Signature

class MinHashAlgorithmConfig(AlgorithmConfig):
    def get_embed_func(
        self,
    ) -> Callable[[list[str], list[int]], dict[str, list[int | bytes]]]:
        """Create a function that embeds a string into a list of
        (band_idx, band_val, index) tuples.

        Returns
        -------
        Callable
            Closure capturing permutations, hash_func, hash_ranges, ngrams_func.
        """

Import

from text_dedup.config import MinHashAlgorithmConfig
from text_dedup.minhash import fingerprint

I/O Contract

Inputs

Name Type Required Description
text_col list[str] Yes Batch of text strings (batch_size=1, so single element)
idx_col list[int] Yes Batch of document indices

Outputs

Name Type Description
__band_idx__ list[int] Band index for each band (0 to num_bands-1)
__band_val__ list[bytes] Band signature bytes for each band
list[int] Document index repeated for each band

Usage Examples

Fingerprinting a Dataset

from text_dedup.config.base import load_config_from_toml
from text_dedup.minhash import fingerprint, load_and_preprocess
from pathlib import Path

config = load_config_from_toml(Path("configs/minhash.toml"))
ds, original_len, filtered_len = load_and_preprocess(config)

# Generate MinHash band signatures
embedded = fingerprint(config, ds)
print(embedded.column_names)  # ['__band_idx__', '__band_val__', '__INDEX__']

Related Pages

Implements Principle

Requires Environment

Uses Heuristic

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment