Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:NVIDIA NeMo Curator FilePartitioningStage

From Leeroopedia
Implementation Metadata
Attribute Value
Domains Data_Curation, Distributed_Computing
Implements NVIDIA_NeMo_Curator_File_Partitioning
Last Updated 2026-02-14 17:00 GMT

Overview

FilePartitioningStage is the NeMo Curator processing stage that distributes input files into balanced groups for parallel processing, partitioning either by file count or by cumulative byte size.

Description

FilePartitioningStage is a dataclass-based ProcessingStage that accepts an _EmptyTask sentinel as input (since it is typically the first stage in a pipeline) and produces a list of FileGroupTask objects. Each output task contains a list of file paths representing one partition of work.

The stage supports two mutually exclusive partitioning strategies:

  • files_per_partition — A fixed number of files per group.
  • blocksize — A target byte size per group (accepts human-readable strings like "1GiB").

Additional features include file extension filtering, a limit parameter for capping the total number of files, and storage_options for accessing files from remote filesystems (e.g., S3, GCS).

Usage

from nemo_curator.stages.file_partitioning import FilePartitioningStage

# Create stage with size-based partitioning
partitioning_stage = FilePartitioningStage(
    file_paths="/data/raw_corpus/",
    blocksize="1GiB",
    file_extensions=[".jsonl", ".json"],
    limit=10000,
    name="file_partitioning",
)

# Execute the stage (typically via a pipeline runner)
tasks = partitioning_stage.process(empty_task)
for task in tasks:
    print(f"Partition with {len(task.data)} files")

Code Reference

Source Location

nemo_curator/stages/file_partitioning.py, lines 32–302.

Signature

@dataclass
class FilePartitioningStage(ProcessingStage[_EmptyTask, FileGroupTask]):
    file_paths: str | list[str]
    files_per_partition: int | None = None
    blocksize: int | str | None = None
    file_extensions: list[str] | None = None
    storage_options: dict[str, Any] | None = None
    limit: int | None = None
    name: str = "file_partitioning"

Import

from nemo_curator.stages.file_partitioning import FilePartitioningStage

I/O Contract

I/O Contract
Direction Type Description
Input _EmptyTask Sentinel empty task (this stage is a pipeline entry point)
Output list[FileGroupTask] Each FileGroupTask.data contains a list of file paths for that partition
Parameters file_paths A directory path or explicit list of file paths to partition
Parameters files_per_partition Number of files per group (mutually exclusive with blocksize)
Parameters blocksize Target byte size per group, e.g., "1GiB" (mutually exclusive with files_per_partition)
Parameters file_extensions Optional list of file extensions to filter (e.g., [".jsonl"])
Parameters limit Optional cap on total number of files to process

Usage Examples

Example 1: Count-based partitioning for a local directory

from nemo_curator.stages.file_partitioning import FilePartitioningStage

stage = FilePartitioningStage(
    file_paths="/data/common_crawl/",
    files_per_partition=100,
    file_extensions=[".jsonl"],
)

Example 2: Size-based partitioning with remote storage

from nemo_curator.stages.file_partitioning import FilePartitioningStage

stage = FilePartitioningStage(
    file_paths="s3://my-bucket/corpus/",
    blocksize="512MiB",
    file_extensions=[".parquet"],
    storage_options={"key": "ACCESS_KEY", "secret": "SECRET_KEY"},
)

Example 3: Limited partitioning for development

from nemo_curator.stages.file_partitioning import FilePartitioningStage

stage = FilePartitioningStage(
    file_paths="/data/corpus/",
    files_per_partition=10,
    limit=50,  # Only process first 50 files
)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment