Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:NVIDIA NeMo Curator ParquetWriter

From Leeroopedia
Knowledge Sources
Domains Data_Curation, NLP, Data_Engineering
Last Updated 2026-02-14 17:00 GMT

Overview

Concrete tool for writing curated document batches to Apache Parquet files provided by NeMo Curator.

Description

The ParquetWriter is a dataclass-based writer stage that serializes DocumentBatch DataFrames to Parquet format using pandas. Additional writers include JsonlWriter for JSONL output and MegatronTokenizerWriter for pre-tokenized Megatron-LM binary format (.bin + .idx files).

Usage

Import this writer when exporting curated text data to Parquet format. Use JsonlWriter for JSONL output or MegatronTokenizerWriter for Megatron-LM training.

Code Reference

Source Location

  • Repository: NeMo Curator
  • File: nemo_curator/stages/text/io/writer/parquet.py
  • Lines: L23-44

Signature

@dataclass
class ParquetWriter(BaseWriter):
    """Writer that writes a DocumentBatch to a Parquet file using pandas."""
    path: str = None
    fields: list[str] = None
    mode: Literal["ignore", "overwrite", "append", "error"] = "ignore"
    write_kwargs: dict = field(default_factory=dict)

Import

from nemo_curator.stages.text.io.writer.parquet import ParquetWriter
from nemo_curator.stages.text.io.writer.jsonl import JsonlWriter
from nemo_curator.stages.text.io.writer.megatron_tokenizer import MegatronTokenizerWriter

I/O Contract

Inputs

Name Type Required Description
task DocumentBatch Yes DataFrame with text and metadata columns

Outputs

Name Type Description
files FileGroupTask Paths to written parquet files

Usage Examples

Basic Parquet Export

from nemo_curator.stages.text.io.writer.parquet import ParquetWriter

writer = ParquetWriter(
    path="./output/curated_data",
    fields=["text", "url", "language", "quality_pred"],
    mode="overwrite",
)

Megatron Tokenizer Export

from nemo_curator.stages.text.io.writer.megatron_tokenizer import MegatronTokenizerWriter

writer = MegatronTokenizerWriter(
    path="./output/megatron_data",
    model_identifier="nvidia/megatron-gpt2-345m",
    text_field="text",
    tokenization_batch_size=1000,
    append_eod=True,
)

Related Pages

Implements Principle

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment