Implementation:NVIDIA NeMo Curator ContentTypeClassifier
| Knowledge Sources | |
|---|---|
| Domains | NLP, Classification, Content Analysis |
| Last Updated | 2026-02-14 00:00 GMT |
Overview
Implements a text classifier that categorizes documents into one of 11 distinct content/speech types using the nvidia/content-type-classifier-deberta HuggingFace model, optimized for multi-node, multi-GPU distributed inference.
Description
ContentTypeClassifier extends DistributedDataClassifier to provide a ready-to-use content type classification stage. It uses the NemoCurator Content Type Classifier DeBERTa model (nvidia/content-type-classifier-deberta) to analyze textual content and classify documents into one of 11 speech types based on their content characteristics.
The classifier is preconfigured with content-type-specific defaults:
- Model:
nvidia/content-type-classifier-deberta - Max sequence length: 1024 tokens
- Max characters: 5000 (text is truncated beyond this limit before tokenization)
- Label field:
content_pred - Padding side: Right (DeBERTa tokenizer standard)
The stage name is automatically derived from the model identifier via the format_name_with_suffix utility.
Usage
Use ContentTypeClassifier when you need to categorize documents by their content or speech type (e.g., narrative, instructional, conversational, etc.) as part of a data curation pipeline. It is particularly useful for filtering datasets to retain only specific content types or for adding content type metadata for downstream analysis.
Code Reference
Source Location
- Repository: NeMo-Curator
- File: nemo_curator/stages/text/classifiers/content_type.py
- Lines: 1-78
Signature
CONTENT_TYPE_MODEL_IDENTIFIER = "nvidia/content-type-classifier-deberta"
MAX_SEQ_LENGTH = 1024
class ContentTypeClassifier(DistributedDataClassifier):
def __init__(
self,
cache_dir: str | None = None,
label_field: str = "content_pred",
score_field: str | None = None,
text_field: str = "text",
filter_by: list[str] | None = None,
max_chars: int = 5000,
sort_by_length: bool = True,
model_inference_batch_size: int = 256,
autocast: bool = True,
): ...
Import
from nemo_curator.stages.text.classifiers.content_type import ContentTypeClassifier
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| cache_dir | str or None | No | HuggingFace cache directory for model files (default: None) |
| label_field | str | No | Name of the prediction output column (default: "content_pred") |
| score_field | str or None | No | Name of the probability output column; None means scores are not retained |
| text_field | str | No | Name of the text field in the input data (default: "text") |
| filter_by | list[str] or None | No | List of content type labels to keep; if set, non-matching documents are removed |
| max_chars | int | No | Maximum characters to feed to the tokenizer (default: 5000) |
| sort_by_length | bool | No | Sort input by token length for inference performance (default: True) |
| model_inference_batch_size | int | No | Batch size for model inference (default: 256) |
| autocast | bool | No | Use autocast for faster inference at minor accuracy cost (default: True) |
Outputs
| Name | Type | Description |
|---|---|---|
| DocumentBatch | DocumentBatch | Input batch augmented with content type prediction column |
| content_pred column | str (per row) | Predicted content type label (one of 11 categories) |
| score_field column | list[float] (per row) | Probability vector across all 11 content types (only if score_field is specified) |
Usage Examples
Basic Usage
from nemo_curator.stages.text.classifiers.content_type import ContentTypeClassifier
# Create a content type classifier with default settings
classifier = ContentTypeClassifier()
With Score Output
# Classify and also output probability scores
classifier = ContentTypeClassifier(
label_field="content_pred",
score_field="content_prob",
max_chars=5000,
)
Filtering by Content Type
# Only keep documents classified as narrative or instructional
classifier = ContentTypeClassifier(
filter_by=["Narrative", "Instructional"],
model_inference_batch_size=512,
)
Related Pages
- NVIDIA_NeMo_Curator_BaseClassifierStage - Parent class providing the DistributedDataClassifier infrastructure
- NVIDIA_NeMo_Curator_QualityClassifier - Sibling classifier for text quality assessment
- NVIDIA_NeMo_Curator_DomainClassifier - Sibling classifier for domain classification
- Environment:NVIDIA_NeMo_Curator_Python_Linux_Base