Principle:NVIDIA NeMo Curator File Partitioning
| Attribute | Value |
|---|---|
| Domains | Data_Curation, Distributed_Computing |
| Implemented By | NVIDIA_NeMo_Curator_FilePartitioningStage |
| Last Updated | 2026-02-14 17:00 GMT |
Overview
File Partitioning is a technique for distributing input files into balanced groups for parallel processing across distributed workers, ensuring even workload distribution in large-scale data curation pipelines.
Description
File Partitioning divides a collection of input files into groups (partitions) that can be independently processed by distributed workers. The partitioning can be performed by two strategies:
- Count-based partitioning — Groups files into partitions of a fixed number of files (e.g., 10 files per partition).
- Size-based partitioning — Groups files into partitions that do not exceed a target byte size (e.g., 1 GiB per partition), ensuring balanced memory utilization across workers.
The stage accepts file paths as input (either a single directory path or an explicit list of file paths), optionally filters by file extension, and produces a list of FileGroupTask objects where each task's .data field contains the list of file paths assigned to that partition. An optional limit parameter allows restricting the total number of files processed, which is useful for development and testing.
Usage
File Partitioning is typically the first stage in any NeMo Curator processing pipeline. It converts a raw set of input files into structured work units that subsequent stages (such as MinHash computation or text filtering) can consume in parallel. By controlling partition granularity, users can tune the tradeoff between parallelism overhead and load balancing.
from nemo_curator.stages.file_partitioning import FilePartitioningStage
# Partition by file count
stage = FilePartitioningStage(
file_paths="/data/corpus/",
files_per_partition=50,
file_extensions=[".jsonl"],
)
# Partition by byte size (1 GiB groups)
stage = FilePartitioningStage(
file_paths="/data/corpus/",
blocksize="1GiB",
file_extensions=[".jsonl"],
)
Theoretical Basis
File Partitioning is grounded in the principle of load balancing for distributed computing systems. Naive round-robin distribution of files can lead to severe workload skew when file sizes vary significantly. Size-aware partitioning addresses this by ensuring each worker receives approximately the same total byte volume, which:
- Minimizes straggler effects — No single worker is burdened with a disproportionately large partition while others sit idle.
- Ensures uniform GPU utilization — In GPU-accelerated pipelines, balanced partitions prevent GPU memory overflow on some workers while others underutilize their capacity.
- Reduces pipeline latency — The overall wall-clock time of a parallel stage is determined by the slowest worker; balanced partitions minimize this bottleneck.
The partitioning algorithm iterates through files sorted by size, greedily accumulating files into the current partition until the byte-size threshold is reached, then starts a new partition. This greedy bin-packing approach provides a practical approximation to the NP-hard optimal bin-packing problem while running in O(n) time after sorting.
Related Pages
- Implementation:NVIDIA_NeMo_Curator_FilePartitioningStage
- NVIDIA_NeMo_Curator_MinHash_Signature_Computation — Often the next stage after file partitioning in fuzzy deduplication pipelines
- NVIDIA_NeMo_Curator_FuzzyDeduplicationWorkflow — The parent workflow that uses file partitioning as its entry point