Implementation:NVIDIA NeMo Curator ContentTypeClassifier

Knowledge Sources	NVIDIA NeMo Curator
Domains	NLP, Classification, Content Analysis
Last Updated	2026-02-14 00:00 GMT

Overview

Implements a text classifier that categorizes documents into one of 11 distinct content/speech types using the nvidia/content-type-classifier-deberta HuggingFace model, optimized for multi-node, multi-GPU distributed inference.

Description

ContentTypeClassifier extends DistributedDataClassifier to provide a ready-to-use content type classification stage. It uses the NemoCurator Content Type Classifier DeBERTa model (nvidia/content-type-classifier-deberta) to analyze textual content and classify documents into one of 11 speech types based on their content characteristics.

The classifier is preconfigured with content-type-specific defaults:

Model: nvidia/content-type-classifier-deberta
Max sequence length: 1024 tokens
Max characters: 5000 (text is truncated beyond this limit before tokenization)
Label field: content_pred
Padding side: Right (DeBERTa tokenizer standard)

The stage name is automatically derived from the model identifier via the format_name_with_suffix utility.

Usage

Use ContentTypeClassifier when you need to categorize documents by their content or speech type (e.g., narrative, instructional, conversational, etc.) as part of a data curation pipeline. It is particularly useful for filtering datasets to retain only specific content types or for adding content type metadata for downstream analysis.

Code Reference

Source Location

Repository: NeMo-Curator
File: nemo_curator/stages/text/classifiers/content_type.py
Lines: 1-78

Signature

CONTENT_TYPE_MODEL_IDENTIFIER = "nvidia/content-type-classifier-deberta"
MAX_SEQ_LENGTH = 1024

class ContentTypeClassifier(DistributedDataClassifier):
    def __init__(
        self,
        cache_dir: str | None = None,
        label_field: str = "content_pred",
        score_field: str | None = None,
        text_field: str = "text",
        filter_by: list[str] | None = None,
        max_chars: int = 5000,
        sort_by_length: bool = True,
        model_inference_batch_size: int = 256,
        autocast: bool = True,
    ): ...

Import

from nemo_curator.stages.text.classifiers.content_type import ContentTypeClassifier

I/O Contract

Inputs

Name	Type	Required	Description
cache_dir	str or None	No	HuggingFace cache directory for model files (default: None)
label_field	str	No	Name of the prediction output column (default: "content_pred")
score_field	str or None	No	Name of the probability output column; None means scores are not retained
text_field	str	No	Name of the text field in the input data (default: "text")
filter_by	list[str] or None	No	List of content type labels to keep; if set, non-matching documents are removed
max_chars	int	No	Maximum characters to feed to the tokenizer (default: 5000)
sort_by_length	bool	No	Sort input by token length for inference performance (default: True)
model_inference_batch_size	int	No	Batch size for model inference (default: 256)
autocast	bool	No	Use autocast for faster inference at minor accuracy cost (default: True)

Outputs

Name	Type	Description
DocumentBatch	DocumentBatch	Input batch augmented with content type prediction column
content_pred column	str (per row)	Predicted content type label (one of 11 categories)
score_field column	list[float] (per row)	Probability vector across all 11 content types (only if score_field is specified)

Usage Examples

Basic Usage

from nemo_curator.stages.text.classifiers.content_type import ContentTypeClassifier

# Create a content type classifier with default settings
classifier = ContentTypeClassifier()

With Score Output

# Classify and also output probability scores
classifier = ContentTypeClassifier(
    label_field="content_pred",
    score_field="content_prob",
    max_chars=5000,
)

Filtering by Content Type

# Only keep documents classified as narrative or instructional
classifier = ContentTypeClassifier(
    filter_by=["Narrative", "Instructional"],
    model_inference_batch_size=512,
)

Related Pages

NVIDIA_NeMo_Curator_BaseClassifierStage - Parent class providing the DistributedDataClassifier infrastructure
NVIDIA_NeMo_Curator_QualityClassifier - Sibling classifier for text quality assessment
NVIDIA_NeMo_Curator_DomainClassifier - Sibling classifier for domain classification
Environment:NVIDIA_NeMo_Curator_Python_Linux_Base

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment