Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:NVIDIA NeMo Curator ContentTypeClassifier

From Leeroopedia
Knowledge Sources
Domains NLP, Classification, Content Analysis
Last Updated 2026-02-14 00:00 GMT

Overview

Implements a text classifier that categorizes documents into one of 11 distinct content/speech types using the nvidia/content-type-classifier-deberta HuggingFace model, optimized for multi-node, multi-GPU distributed inference.

Description

ContentTypeClassifier extends DistributedDataClassifier to provide a ready-to-use content type classification stage. It uses the NemoCurator Content Type Classifier DeBERTa model (nvidia/content-type-classifier-deberta) to analyze textual content and classify documents into one of 11 speech types based on their content characteristics.

The classifier is preconfigured with content-type-specific defaults:

  • Model: nvidia/content-type-classifier-deberta
  • Max sequence length: 1024 tokens
  • Max characters: 5000 (text is truncated beyond this limit before tokenization)
  • Label field: content_pred
  • Padding side: Right (DeBERTa tokenizer standard)

The stage name is automatically derived from the model identifier via the format_name_with_suffix utility.

Usage

Use ContentTypeClassifier when you need to categorize documents by their content or speech type (e.g., narrative, instructional, conversational, etc.) as part of a data curation pipeline. It is particularly useful for filtering datasets to retain only specific content types or for adding content type metadata for downstream analysis.

Code Reference

Source Location

  • Repository: NeMo-Curator
  • File: nemo_curator/stages/text/classifiers/content_type.py
  • Lines: 1-78

Signature

CONTENT_TYPE_MODEL_IDENTIFIER = "nvidia/content-type-classifier-deberta"
MAX_SEQ_LENGTH = 1024

class ContentTypeClassifier(DistributedDataClassifier):
    def __init__(
        self,
        cache_dir: str | None = None,
        label_field: str = "content_pred",
        score_field: str | None = None,
        text_field: str = "text",
        filter_by: list[str] | None = None,
        max_chars: int = 5000,
        sort_by_length: bool = True,
        model_inference_batch_size: int = 256,
        autocast: bool = True,
    ): ...

Import

from nemo_curator.stages.text.classifiers.content_type import ContentTypeClassifier

I/O Contract

Inputs

Name Type Required Description
cache_dir str or None No HuggingFace cache directory for model files (default: None)
label_field str No Name of the prediction output column (default: "content_pred")
score_field str or None No Name of the probability output column; None means scores are not retained
text_field str No Name of the text field in the input data (default: "text")
filter_by list[str] or None No List of content type labels to keep; if set, non-matching documents are removed
max_chars int No Maximum characters to feed to the tokenizer (default: 5000)
sort_by_length bool No Sort input by token length for inference performance (default: True)
model_inference_batch_size int No Batch size for model inference (default: 256)
autocast bool No Use autocast for faster inference at minor accuracy cost (default: True)

Outputs

Name Type Description
DocumentBatch DocumentBatch Input batch augmented with content type prediction column
content_pred column str (per row) Predicted content type label (one of 11 categories)
score_field column list[float] (per row) Probability vector across all 11 content types (only if score_field is specified)

Usage Examples

Basic Usage

from nemo_curator.stages.text.classifiers.content_type import ContentTypeClassifier

# Create a content type classifier with default settings
classifier = ContentTypeClassifier()

With Score Output

# Classify and also output probability scores
classifier = ContentTypeClassifier(
    label_field="content_pred",
    score_field="content_prob",
    max_chars=5000,
)

Filtering by Content Type

# Only keep documents classified as narrative or instructional
classifier = ContentTypeClassifier(
    filter_by=["Narrative", "Instructional"],
    model_inference_batch_size=512,
)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment