Implementation:NVIDIA NeMo Curator DocumentBatch
| Knowledge Sources | |
|---|---|
| Domains | Data Curation, Text Processing, Pipeline Tasks |
| Last Updated | 2026-02-14 00:00 GMT |
Overview
The DocumentBatch class defines the core task type for processing batches of text documents in the NeMo Curator pipeline, storing tabular document data as either a PyArrow Table or Pandas DataFrame.
Description
DocumentBatch extends Task[pa.Table | pd.DataFrame] and serves as the primary data container for text-based data curation workflows. It provides seamless interoperability between PyArrow and Pandas backends through dedicated conversion methods.
Data storage: The data field defaults to a PyArrow Table factory and can hold either a pa.Table or pd.DataFrame.
Format conversion: The to_pyarrow() method converts the data to a PyArrow Table (passthrough if already PyArrow, conversion via pa.Table.from_pandas() if Pandas). The to_pandas() method converts to a Pandas DataFrame (passthrough if already Pandas, conversion via .to_pandas() if PyArrow). Both methods raise TypeError for unsupported data types.
Column access: The get_columns() method returns column names from either data format, using .columns for DataFrames or .column_names for PyArrow Tables.
Validation: The validate() method checks that the batch has at least one row and at least one column, logging warnings via loguru on failure.
Item counting: The num_items property returns the number of rows (documents) using len(self.data).
Usage
DocumentBatch is the standard task type consumed and produced by virtually all text processing stages in NeMo Curator, including filtering, classification, deduplication, modification, splitting, and joining stages. It is the fundamental data unit flowing through text curation pipelines.
Code Reference
Source Location
- Repository: NeMo-Curator
- File:
nemo_curator/tasks/document.py - Lines: 1-80
Signature
@dataclass
class DocumentBatch(Task[pa.Table | pd.DataFrame]):
data: pa.Table | pd.DataFrame = field(default_factory=pa.Table)
def to_pyarrow(self) -> pa.Table: ...
def to_pandas(self) -> pd.DataFrame: ...
@property
def num_items(self) -> int: ...
def get_columns(self) -> list[str]: ...
def validate(self) -> bool: ...
Import
from nemo_curator.tasks.document import DocumentBatch
# or
from nemo_curator.tasks import DocumentBatch
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| data | pa.Table or pd.DataFrame | No | Tabular document data (default: empty PyArrow Table) |
| task_id | str | Yes | Unique identifier for this task (inherited from Task) |
| dataset_name | str | Yes | Name of the dataset this task belongs to (inherited from Task) |
Outputs
| Name | Type | Description |
|---|---|---|
| to_pyarrow() | pa.Table | Document data as a PyArrow Table |
| to_pandas() | pd.DataFrame | Document data as a Pandas DataFrame |
| get_columns() | list[str] | Column names from the underlying data |
| num_items | int | Number of documents (rows) in the batch |
| validate() | bool | Whether the batch has non-empty data with at least one column |
Usage Examples
Creating a DocumentBatch from Pandas
import pandas as pd
from nemo_curator.tasks import DocumentBatch
df = pd.DataFrame({
"text": ["Hello world", "Foo bar baz"],
"language": ["en", "en"],
})
batch = DocumentBatch(
task_id="batch_001",
dataset_name="my_dataset",
data=df,
)
print(batch.num_items) # 2
print(batch.get_columns()) # ["text", "language"]
Format Conversion
from nemo_curator.tasks import DocumentBatch
# Convert between formats
arrow_table = batch.to_pyarrow()
pandas_df = batch.to_pandas()
Validation
from nemo_curator.tasks import DocumentBatch
import pandas as pd
# Valid batch
batch = DocumentBatch(
task_id="valid",
dataset_name="ds",
data=pd.DataFrame({"text": ["content"]}),
)
print(batch.validate()) # True
Related Pages
- Environment:NVIDIA_NeMo_Curator_Python_Linux_Base
- NVIDIA_NeMo_Curator_Task_Base - Abstract base class that DocumentBatch extends
- NVIDIA_NeMo_Curator_AudioBatch - Analogous task type for audio data
- NVIDIA_NeMo_Curator_ImageBatch - Analogous task type for image data
- NVIDIA_NeMo_Curator_AddId - Stage that adds IDs to DocumentBatch records
- NVIDIA_NeMo_Curator_Modify_Module - Stage that modifies DocumentBatch fields