Implementation:NVIDIA NeMo Curator ParquetWriter
Appearance
| Knowledge Sources | |
|---|---|
| Domains | Data_Curation, NLP, Data_Engineering |
| Last Updated | 2026-02-14 17:00 GMT |
Overview
Concrete tool for writing curated document batches to Apache Parquet files provided by NeMo Curator.
Description
The ParquetWriter is a dataclass-based writer stage that serializes DocumentBatch DataFrames to Parquet format using pandas. Additional writers include JsonlWriter for JSONL output and MegatronTokenizerWriter for pre-tokenized Megatron-LM binary format (.bin + .idx files).
Usage
Import this writer when exporting curated text data to Parquet format. Use JsonlWriter for JSONL output or MegatronTokenizerWriter for Megatron-LM training.
Code Reference
Source Location
- Repository: NeMo Curator
- File: nemo_curator/stages/text/io/writer/parquet.py
- Lines: L23-44
Signature
@dataclass
class ParquetWriter(BaseWriter):
"""Writer that writes a DocumentBatch to a Parquet file using pandas."""
path: str = None
fields: list[str] = None
mode: Literal["ignore", "overwrite", "append", "error"] = "ignore"
write_kwargs: dict = field(default_factory=dict)
Import
from nemo_curator.stages.text.io.writer.parquet import ParquetWriter
from nemo_curator.stages.text.io.writer.jsonl import JsonlWriter
from nemo_curator.stages.text.io.writer.megatron_tokenizer import MegatronTokenizerWriter
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| task | DocumentBatch | Yes | DataFrame with text and metadata columns |
Outputs
| Name | Type | Description |
|---|---|---|
| files | FileGroupTask | Paths to written parquet files |
Usage Examples
Basic Parquet Export
from nemo_curator.stages.text.io.writer.parquet import ParquetWriter
writer = ParquetWriter(
path="./output/curated_data",
fields=["text", "url", "language", "quality_pred"],
mode="overwrite",
)
Megatron Tokenizer Export
from nemo_curator.stages.text.io.writer.megatron_tokenizer import MegatronTokenizerWriter
writer = MegatronTokenizerWriter(
path="./output/megatron_data",
model_identifier="nvidia/megatron-gpt2-345m",
text_field="text",
tokenization_batch_size=1000,
append_eod=True,
)
Related Pages
Implements Principle
Page Connections
Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment