Implementation:NVIDIA NeMo Curator DocumentBatch

Knowledge Sources	NVIDIA NeMo Curator
Domains	Data Curation, Text Processing, Pipeline Tasks
Last Updated	2026-02-14 00:00 GMT

Overview

The DocumentBatch class defines the core task type for processing batches of text documents in the NeMo Curator pipeline, storing tabular document data as either a PyArrow Table or Pandas DataFrame.

Description

DocumentBatch extends Task[pa.Table | pd.DataFrame] and serves as the primary data container for text-based data curation workflows. It provides seamless interoperability between PyArrow and Pandas backends through dedicated conversion methods.

Data storage: The data field defaults to a PyArrow Table factory and can hold either a pa.Table or pd.DataFrame.

Format conversion: The to_pyarrow() method converts the data to a PyArrow Table (passthrough if already PyArrow, conversion via pa.Table.from_pandas() if Pandas). The to_pandas() method converts to a Pandas DataFrame (passthrough if already Pandas, conversion via .to_pandas() if PyArrow). Both methods raise TypeError for unsupported data types.

Column access: The get_columns() method returns column names from either data format, using .columns for DataFrames or .column_names for PyArrow Tables.

Validation: The validate() method checks that the batch has at least one row and at least one column, logging warnings via loguru on failure.

Item counting: The num_items property returns the number of rows (documents) using len(self.data).

Usage

DocumentBatch is the standard task type consumed and produced by virtually all text processing stages in NeMo Curator, including filtering, classification, deduplication, modification, splitting, and joining stages. It is the fundamental data unit flowing through text curation pipelines.

Code Reference

Source Location

Repository: NeMo-Curator
File: nemo_curator/tasks/document.py
Lines: 1-80

Signature

@dataclass
class DocumentBatch(Task[pa.Table | pd.DataFrame]):
    data: pa.Table | pd.DataFrame = field(default_factory=pa.Table)

    def to_pyarrow(self) -> pa.Table: ...
    def to_pandas(self) -> pd.DataFrame: ...

    @property
    def num_items(self) -> int: ...
    def get_columns(self) -> list[str]: ...
    def validate(self) -> bool: ...

Import

from nemo_curator.tasks.document import DocumentBatch
# or
from nemo_curator.tasks import DocumentBatch

I/O Contract

Inputs

Name	Type	Required	Description
data	pa.Table or pd.DataFrame	No	Tabular document data (default: empty PyArrow Table)
task_id	str	Yes	Unique identifier for this task (inherited from Task)
dataset_name	str	Yes	Name of the dataset this task belongs to (inherited from Task)

Outputs

Name	Type	Description
to_pyarrow()	pa.Table	Document data as a PyArrow Table
to_pandas()	pd.DataFrame	Document data as a Pandas DataFrame
get_columns()	list[str]	Column names from the underlying data
num_items	int	Number of documents (rows) in the batch
validate()	bool	Whether the batch has non-empty data with at least one column

Usage Examples

Creating a DocumentBatch from Pandas

import pandas as pd
from nemo_curator.tasks import DocumentBatch

df = pd.DataFrame({
    "text": ["Hello world", "Foo bar baz"],
    "language": ["en", "en"],
})

batch = DocumentBatch(
    task_id="batch_001",
    dataset_name="my_dataset",
    data=df,
)

print(batch.num_items)       # 2
print(batch.get_columns())   # ["text", "language"]

Format Conversion

from nemo_curator.tasks import DocumentBatch

# Convert between formats
arrow_table = batch.to_pyarrow()
pandas_df = batch.to_pandas()

Validation

from nemo_curator.tasks import DocumentBatch
import pandas as pd

# Valid batch
batch = DocumentBatch(
    task_id="valid",
    dataset_name="ds",
    data=pd.DataFrame({"text": ["content"]}),
)
print(batch.validate())  # True

Related Pages

Environment:NVIDIA_NeMo_Curator_Python_Linux_Base
NVIDIA_NeMo_Curator_Task_Base - Abstract base class that DocumentBatch extends
NVIDIA_NeMo_Curator_AudioBatch - Analogous task type for audio data
NVIDIA_NeMo_Curator_ImageBatch - Analogous task type for image data
NVIDIA_NeMo_Curator_AddId - Stage that adds IDs to DocumentBatch records
NVIDIA_NeMo_Curator_Modify_Module - Stage that modifies DocumentBatch fields

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment