Implementation:NVIDIA NeMo Curator FilePartitioningStage
| Attribute | Value |
|---|---|
| Domains | Data_Curation, Distributed_Computing |
| Implements | NVIDIA_NeMo_Curator_File_Partitioning |
| Last Updated | 2026-02-14 17:00 GMT |
Overview
FilePartitioningStage is the NeMo Curator processing stage that distributes input files into balanced groups for parallel processing, partitioning either by file count or by cumulative byte size.
Description
FilePartitioningStage is a dataclass-based ProcessingStage that accepts an _EmptyTask sentinel as input (since it is typically the first stage in a pipeline) and produces a list of FileGroupTask objects. Each output task contains a list of file paths representing one partition of work.
The stage supports two mutually exclusive partitioning strategies:
- files_per_partition — A fixed number of files per group.
- blocksize — A target byte size per group (accepts human-readable strings like
"1GiB").
Additional features include file extension filtering, a limit parameter for capping the total number of files, and storage_options for accessing files from remote filesystems (e.g., S3, GCS).
Usage
from nemo_curator.stages.file_partitioning import FilePartitioningStage
# Create stage with size-based partitioning
partitioning_stage = FilePartitioningStage(
file_paths="/data/raw_corpus/",
blocksize="1GiB",
file_extensions=[".jsonl", ".json"],
limit=10000,
name="file_partitioning",
)
# Execute the stage (typically via a pipeline runner)
tasks = partitioning_stage.process(empty_task)
for task in tasks:
print(f"Partition with {len(task.data)} files")
Code Reference
Source Location
nemo_curator/stages/file_partitioning.py, lines 32–302.
Signature
@dataclass
class FilePartitioningStage(ProcessingStage[_EmptyTask, FileGroupTask]):
file_paths: str | list[str]
files_per_partition: int | None = None
blocksize: int | str | None = None
file_extensions: list[str] | None = None
storage_options: dict[str, Any] | None = None
limit: int | None = None
name: str = "file_partitioning"
Import
from nemo_curator.stages.file_partitioning import FilePartitioningStage
I/O Contract
| Direction | Type | Description |
|---|---|---|
| Input | _EmptyTask |
Sentinel empty task (this stage is a pipeline entry point) |
| Output | list[FileGroupTask] |
Each FileGroupTask.data contains a list of file paths for that partition
|
| Parameters | file_paths |
A directory path or explicit list of file paths to partition |
| Parameters | files_per_partition |
Number of files per group (mutually exclusive with blocksize)
|
| Parameters | blocksize |
Target byte size per group, e.g., "1GiB" (mutually exclusive with files_per_partition)
|
| Parameters | file_extensions |
Optional list of file extensions to filter (e.g., [".jsonl"])
|
| Parameters | limit |
Optional cap on total number of files to process |
Usage Examples
Example 1: Count-based partitioning for a local directory
from nemo_curator.stages.file_partitioning import FilePartitioningStage
stage = FilePartitioningStage(
file_paths="/data/common_crawl/",
files_per_partition=100,
file_extensions=[".jsonl"],
)
Example 2: Size-based partitioning with remote storage
from nemo_curator.stages.file_partitioning import FilePartitioningStage
stage = FilePartitioningStage(
file_paths="s3://my-bucket/corpus/",
blocksize="512MiB",
file_extensions=[".parquet"],
storage_options={"key": "ACCESS_KEY", "secret": "SECRET_KEY"},
)
Example 3: Limited partitioning for development
from nemo_curator.stages.file_partitioning import FilePartitioningStage
stage = FilePartitioningStage(
file_paths="/data/corpus/",
files_per_partition=10,
limit=50, # Only process first 50 files
)
Related Pages
- Principle:NVIDIA_NeMo_Curator_File_Partitioning
- Environment:NVIDIA_NeMo_Curator_Python_Linux_Base
- Environment:NVIDIA_NeMo_Curator_Ray_Cluster
- NVIDIA_NeMo_Curator_MinHashStage — Commonly follows FilePartitioningStage in fuzzy deduplication pipelines
- NVIDIA_NeMo_Curator_FuzzyDeduplicationWorkflow — The parent workflow that orchestrates all deduplication stages