Heuristic:NVIDIA NeMo Curator Deduplication Blocksize Tuning
| Knowledge Sources | |
|---|---|
| Domains | Deduplication, Optimization, Performance |
| Last Updated | 2026-02-14 16:45 GMT |
Overview
Use `input_blocksize="2GiB"` for exact deduplication and `input_blocksize="1GiB"` for fuzzy deduplication to balance throughput and memory usage on GPU-accelerated pipelines.
Description
The `input_blocksize` parameter controls how input data files are partitioned for parallel processing. Larger blocksizes mean fewer but larger partitions, which reduces scheduling overhead but increases per-worker memory usage. Smaller blocksizes produce more partitions with lower per-worker memory usage but more scheduling overhead. The optimal blocksize differs between exact and fuzzy deduplication due to their different memory profiles.
Usage
Apply this heuristic when configuring `ExactDeduplicationWorkflow` or `FuzzyDeduplicationWorkflow`. If you encounter OOM errors during deduplication, reduce the blocksize. If throughput is too low and memory is available, increase the blocksize.
The Insight (Rule of Thumb)
- Action: Set `input_blocksize` appropriately for each deduplication type.
- Values:
- Exact deduplication: `input_blocksize="2GiB"` (default)
- Fuzzy deduplication: `input_blocksize="1GiB"` (default)
- Trade-off: Larger blocksize increases throughput but requires more GPU memory per worker. Smaller blocksize reduces OOM risk but increases scheduling overhead.
- Additional tip: Always clear the cache/output directory between runs to avoid stale data contaminating results.
Reasoning
Exact deduplication is relatively lightweight per-record (hashing-based), so larger partitions can be processed without excessive memory pressure. Fuzzy deduplication involves MinHash signature computation and LSH, which are more memory-intensive due to intermediate data structures (hash tables, bucket arrays). Hence the smaller default blocksize.
From `nemo_curator/stages/deduplication/exact/workflow.py:55`:
input_blocksize: str | int = "2GiB",
From `nemo_curator/stages/deduplication/fuzzy/workflow.py:69`:
input_blocksize: str | int = "1GiB",
Performance also depends on `bands_per_iteration` and `char_ngrams` for fuzzy dedup. More bands per iteration means fewer passes over the data but higher memory usage per pass.