Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:NVIDIA NeMo Curator DocumentBatch

From Leeroopedia
Knowledge Sources
Domains Data Curation, Text Processing, Pipeline Tasks
Last Updated 2026-02-14 00:00 GMT

Overview

The DocumentBatch class defines the core task type for processing batches of text documents in the NeMo Curator pipeline, storing tabular document data as either a PyArrow Table or Pandas DataFrame.

Description

DocumentBatch extends Task[pa.Table | pd.DataFrame] and serves as the primary data container for text-based data curation workflows. It provides seamless interoperability between PyArrow and Pandas backends through dedicated conversion methods.

Data storage: The data field defaults to a PyArrow Table factory and can hold either a pa.Table or pd.DataFrame.

Format conversion: The to_pyarrow() method converts the data to a PyArrow Table (passthrough if already PyArrow, conversion via pa.Table.from_pandas() if Pandas). The to_pandas() method converts to a Pandas DataFrame (passthrough if already Pandas, conversion via .to_pandas() if PyArrow). Both methods raise TypeError for unsupported data types.

Column access: The get_columns() method returns column names from either data format, using .columns for DataFrames or .column_names for PyArrow Tables.

Validation: The validate() method checks that the batch has at least one row and at least one column, logging warnings via loguru on failure.

Item counting: The num_items property returns the number of rows (documents) using len(self.data).

Usage

DocumentBatch is the standard task type consumed and produced by virtually all text processing stages in NeMo Curator, including filtering, classification, deduplication, modification, splitting, and joining stages. It is the fundamental data unit flowing through text curation pipelines.

Code Reference

Source Location

  • Repository: NeMo-Curator
  • File: nemo_curator/tasks/document.py
  • Lines: 1-80

Signature

@dataclass
class DocumentBatch(Task[pa.Table | pd.DataFrame]):
    data: pa.Table | pd.DataFrame = field(default_factory=pa.Table)

    def to_pyarrow(self) -> pa.Table: ...
    def to_pandas(self) -> pd.DataFrame: ...

    @property
    def num_items(self) -> int: ...
    def get_columns(self) -> list[str]: ...
    def validate(self) -> bool: ...

Import

from nemo_curator.tasks.document import DocumentBatch
# or
from nemo_curator.tasks import DocumentBatch

I/O Contract

Inputs

Name Type Required Description
data pa.Table or pd.DataFrame No Tabular document data (default: empty PyArrow Table)
task_id str Yes Unique identifier for this task (inherited from Task)
dataset_name str Yes Name of the dataset this task belongs to (inherited from Task)

Outputs

Name Type Description
to_pyarrow() pa.Table Document data as a PyArrow Table
to_pandas() pd.DataFrame Document data as a Pandas DataFrame
get_columns() list[str] Column names from the underlying data
num_items int Number of documents (rows) in the batch
validate() bool Whether the batch has non-empty data with at least one column

Usage Examples

Creating a DocumentBatch from Pandas

import pandas as pd
from nemo_curator.tasks import DocumentBatch

df = pd.DataFrame({
    "text": ["Hello world", "Foo bar baz"],
    "language": ["en", "en"],
})

batch = DocumentBatch(
    task_id="batch_001",
    dataset_name="my_dataset",
    data=df,
)

print(batch.num_items)       # 2
print(batch.get_columns())   # ["text", "language"]

Format Conversion

from nemo_curator.tasks import DocumentBatch

# Convert between formats
arrow_table = batch.to_pyarrow()
pandas_df = batch.to_pandas()

Validation

from nemo_curator.tasks import DocumentBatch
import pandas as pd

# Valid batch
batch = DocumentBatch(
    task_id="valid",
    dataset_name="ds",
    data=pd.DataFrame({"text": ["content"]}),
)
print(batch.validate())  # True

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment