Heuristic:NVIDIA NeMo Curator Deduplication Blocksize Tuning

Knowledge Sources	NVIDIA NeMo Curator NeMo Curator Deduplication Docs
Domains	Deduplication, Optimization, Performance
Last Updated	2026-02-14 16:45 GMT

Overview

Use `input_blocksize="2GiB"` for exact deduplication and `input_blocksize="1GiB"` for fuzzy deduplication to balance throughput and memory usage on GPU-accelerated pipelines.

Description

The `input_blocksize` parameter controls how input data files are partitioned for parallel processing. Larger blocksizes mean fewer but larger partitions, which reduces scheduling overhead but increases per-worker memory usage. Smaller blocksizes produce more partitions with lower per-worker memory usage but more scheduling overhead. The optimal blocksize differs between exact and fuzzy deduplication due to their different memory profiles.

Usage

Apply this heuristic when configuring `ExactDeduplicationWorkflow` or `FuzzyDeduplicationWorkflow`. If you encounter OOM errors during deduplication, reduce the blocksize. If throughput is too low and memory is available, increase the blocksize.

The Insight (Rule of Thumb)

Action: Set `input_blocksize` appropriately for each deduplication type.
Values:
- Exact deduplication: `input_blocksize="2GiB"` (default)
- Fuzzy deduplication: `input_blocksize="1GiB"` (default)
Trade-off: Larger blocksize increases throughput but requires more GPU memory per worker. Smaller blocksize reduces OOM risk but increases scheduling overhead.
Additional tip: Always clear the cache/output directory between runs to avoid stale data contaminating results.

Reasoning

Exact deduplication is relatively lightweight per-record (hashing-based), so larger partitions can be processed without excessive memory pressure. Fuzzy deduplication involves MinHash signature computation and LSH, which are more memory-intensive due to intermediate data structures (hash tables, bucket arrays). Hence the smaller default blocksize.

From `nemo_curator/stages/deduplication/exact/workflow.py:55`:

input_blocksize: str | int = "2GiB",

From `nemo_curator/stages/deduplication/fuzzy/workflow.py:69`:

input_blocksize: str | int = "1GiB",

Performance also depends on `bands_per_iteration` and `char_ngrams` for fuzzy dedup. More bands per iteration means fewer passes over the data but higher memory usage per pass.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment