Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Heuristic:NVIDIA NeMo Curator Deduplication Blocksize Tuning

From Leeroopedia
Knowledge Sources
Domains Deduplication, Optimization, Performance
Last Updated 2026-02-14 16:45 GMT

Overview

Use `input_blocksize="2GiB"` for exact deduplication and `input_blocksize="1GiB"` for fuzzy deduplication to balance throughput and memory usage on GPU-accelerated pipelines.

Description

The `input_blocksize` parameter controls how input data files are partitioned for parallel processing. Larger blocksizes mean fewer but larger partitions, which reduces scheduling overhead but increases per-worker memory usage. Smaller blocksizes produce more partitions with lower per-worker memory usage but more scheduling overhead. The optimal blocksize differs between exact and fuzzy deduplication due to their different memory profiles.

Usage

Apply this heuristic when configuring `ExactDeduplicationWorkflow` or `FuzzyDeduplicationWorkflow`. If you encounter OOM errors during deduplication, reduce the blocksize. If throughput is too low and memory is available, increase the blocksize.

The Insight (Rule of Thumb)

  • Action: Set `input_blocksize` appropriately for each deduplication type.
  • Values:
    • Exact deduplication: `input_blocksize="2GiB"` (default)
    • Fuzzy deduplication: `input_blocksize="1GiB"` (default)
  • Trade-off: Larger blocksize increases throughput but requires more GPU memory per worker. Smaller blocksize reduces OOM risk but increases scheduling overhead.
  • Additional tip: Always clear the cache/output directory between runs to avoid stale data contaminating results.

Reasoning

Exact deduplication is relatively lightweight per-record (hashing-based), so larger partitions can be processed without excessive memory pressure. Fuzzy deduplication involves MinHash signature computation and LSH, which are more memory-intensive due to intermediate data structures (hash tables, bucket arrays). Hence the smaller default blocksize.

From `nemo_curator/stages/deduplication/exact/workflow.py:55`:

input_blocksize: str | int = "2GiB",

From `nemo_curator/stages/deduplication/fuzzy/workflow.py:69`:

input_blocksize: str | int = "1GiB",

Performance also depends on `bands_per_iteration` and `char_ngrams` for fuzzy dedup. More bands per iteration means fewer passes over the data but higher memory usage per pass.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment