Heuristic:ChenghaoMou Text dedup Fingerprint Batch Size One
| Knowledge Sources | |
|---|---|
| Domains | Optimization, Text_Deduplication, HuggingFace_Datasets |
| Last Updated | 2026-02-14 21:00 GMT |
Overview
MinHash and SimHash fingerprinting operations use `batched=True, batch_size=1` to ensure each document is processed independently through the embedding function.
Description
Both MinHash and SimHash embedding functions are designed to process exactly one document at a time even though HuggingFace Datasets' `.map()` is called with `batched=True`. The `batch_size=1` parameter ensures each call receives a list of length 1. This is an architectural choice: the embedding functions accept `list[str]` (as required by the batched API) but always index `text_col[0]` to process a single document. The batched API is used because the output contains multiple rows per input document (one per band/permutation).
Usage
Apply this heuristic when modifying the MinHash or SimHash fingerprinting pipeline. Do not increase `batch_size` above 1 for the embedding map operation. The embedding functions are structured to produce fan-out output (multiple rows per input document), which requires the batched API with a single-element batch.
The Insight (Rule of Thumb)
- Action: Always use `batched=True, batch_size=1` for fingerprinting `.map()` calls.
- Value: `batch_size=1` (not configurable via the standard `batch_size` config parameter).
- Trade-off: Slightly higher per-batch overhead than larger batches, but enables the one-to-many output pattern required by the algorithm.
- Why batched=True: The batched API allows returning a dict with lists of different lengths than the input, which is needed because each document produces multiple band/permutation rows.
Reasoning
MinHash produces one row per band (e.g., 50 bands = 50 output rows per document). SimHash produces one row per permutation. HuggingFace Datasets' non-batched `.map()` expects a 1:1 input-to-output mapping. The batched API relaxes this constraint, allowing the function to return lists of any length. Setting `batch_size=1` combines the fan-out capability of the batched API with the simplicity of single-document processing.
Code Evidence
MinHash fingerprinting from `src/text_dedup/minhash.py:36-44`:
result: Dataset = ds.map(
function=algo.get_embed_func(),
input_columns=[algo.text_column, algo.internal_index_column],
remove_columns=[col for col in ds.column_names if col != algo.internal_index_column],
num_proc=config.algorithm.num_proc,
batched=True,
batch_size=1,
desc="Fingerprinting...",
)
SimHash fingerprinting from `src/text_dedup/simhash.py:30-39`:
embedded: Dataset = ds.map(
function=algo.get_embed_func(),
input_columns=[algo.text_column, algo.internal_index_column],
remove_columns=[col for col in ds.column_names if col != algo.internal_index_column],
num_proc=algo.num_proc,
with_indices=False,
batched=True,
batch_size=1,
desc="SimHashing...",
)
The embed function always accesses `text_col[0]` for single-document processing, e.g., `src/text_dedup/config/algorithms/minhash.py:215-216`:
content: str = text_col[0]
idx: int = idx_col[0]