Heuristic:ChenghaoMou Text dedup Fingerprint Batch Size One

Knowledge Sources	text-dedup Internal
Domains	Optimization, Text_Deduplication, HuggingFace_Datasets
Last Updated	2026-02-14 21:00 GMT

Overview

MinHash and SimHash fingerprinting operations use `batched=True, batch_size=1` to ensure each document is processed independently through the embedding function.

Description

Both MinHash and SimHash embedding functions are designed to process exactly one document at a time even though HuggingFace Datasets' `.map()` is called with `batched=True`. The `batch_size=1` parameter ensures each call receives a list of length 1. This is an architectural choice: the embedding functions accept `list[str]` (as required by the batched API) but always index `text_col[0]` to process a single document. The batched API is used because the output contains multiple rows per input document (one per band/permutation).

Usage

Apply this heuristic when modifying the MinHash or SimHash fingerprinting pipeline. Do not increase `batch_size` above 1 for the embedding map operation. The embedding functions are structured to produce fan-out output (multiple rows per input document), which requires the batched API with a single-element batch.

The Insight (Rule of Thumb)

Action: Always use `batched=True, batch_size=1` for fingerprinting `.map()` calls.
Value: `batch_size=1` (not configurable via the standard `batch_size` config parameter).
Trade-off: Slightly higher per-batch overhead than larger batches, but enables the one-to-many output pattern required by the algorithm.
Why batched=True: The batched API allows returning a dict with lists of different lengths than the input, which is needed because each document produces multiple band/permutation rows.

Reasoning

MinHash produces one row per band (e.g., 50 bands = 50 output rows per document). SimHash produces one row per permutation. HuggingFace Datasets' non-batched `.map()` expects a 1:1 input-to-output mapping. The batched API relaxes this constraint, allowing the function to return lists of any length. Setting `batch_size=1` combines the fan-out capability of the batched API with the simplicity of single-document processing.

Code Evidence

MinHash fingerprinting from `src/text_dedup/minhash.py:36-44`:

result: Dataset = ds.map(
    function=algo.get_embed_func(),
    input_columns=[algo.text_column, algo.internal_index_column],
    remove_columns=[col for col in ds.column_names if col != algo.internal_index_column],
    num_proc=config.algorithm.num_proc,
    batched=True,
    batch_size=1,
    desc="Fingerprinting...",
)

SimHash fingerprinting from `src/text_dedup/simhash.py:30-39`:

embedded: Dataset = ds.map(
    function=algo.get_embed_func(),
    input_columns=[algo.text_column, algo.internal_index_column],
    remove_columns=[col for col in ds.column_names if col != algo.internal_index_column],
    num_proc=algo.num_proc,
    with_indices=False,
    batched=True,
    batch_size=1,
    desc="SimHashing...",
)

The embed function always accesses `text_col[0]` for single-document processing, e.g., `src/text_dedup/config/algorithms/minhash.py:215-216`:

content: str = text_col[0]
idx: int = idx_col[0]

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment