Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:NVIDIA NeMo Curator DomainClassifier

From Leeroopedia
Knowledge Sources
Domains NLP, Classification, Multilingual
Last Updated 2026-02-14 00:00 GMT

Overview

Implements English and multilingual domain classifiers using DeBERTa-based models from HuggingFace, enabling categorization of text documents into topic domains for data curation pipelines.

Description

This module provides two domain classifier classes, both extending DistributedDataClassifier:

  • DomainClassifier - Uses the nvidia/domain-classifier model for English text domain classification. Configured with max sequence length of 512 tokens, max 2000 characters, and the label field domain_pred.
  • MultilingualDomainClassifier - Uses the nvidia/multilingual-domain-classifier model supporting domain classification across 52 languages. Configured with the same token and character limits as the English variant but with the label field multilingual_domain_pred.

Both classifiers use right-side tokenizer padding (DeBERTa standard) and derive their stage names automatically from their respective model identifiers via format_name_with_suffix. They are optimized for multi-node, multi-GPU distributed inference on large datasets.

Usage

Use DomainClassifier when working with English-only datasets and you need to classify documents into topic domains. Use MultilingualDomainClassifier when working with multilingual datasets that may contain text in any of the 52 supported languages. Both are suitable for filtering datasets to specific domains or for adding domain metadata for downstream analysis.

Code Reference

Source Location

  • Repository: NeMo-Curator
  • File: nemo_curator/stages/text/classifiers/domain.py
  • Lines: 1-129

Signature

DOMAIN_MODEL_IDENTIFIER = "nvidia/domain-classifier"
MULTILINGUAL_DOMAIN_MODEL_IDENTIFIER = "nvidia/multilingual-domain-classifier"
MAX_SEQ_LENGTH = 512

class DomainClassifier(DistributedDataClassifier):
    def __init__(
        self,
        cache_dir: str | None = None,
        label_field: str = "domain_pred",
        score_field: str | None = None,
        text_field: str = "text",
        filter_by: list[str] | None = None,
        max_chars: int = 2000,
        sort_by_length: bool = True,
        model_inference_batch_size: int = 256,
        autocast: bool = True,
    ): ...

class MultilingualDomainClassifier(DistributedDataClassifier):
    def __init__(
        self,
        cache_dir: str | None = None,
        label_field: str = "multilingual_domain_pred",
        score_field: str | None = None,
        text_field: str = "text",
        filter_by: list[str] | None = None,
        max_chars: int = 2000,
        sort_by_length: bool = True,
        model_inference_batch_size: int = 256,
        autocast: bool = True,
    ): ...

Import

from nemo_curator.stages.text.classifiers.domain import DomainClassifier
from nemo_curator.stages.text.classifiers.domain import MultilingualDomainClassifier

I/O Contract

Inputs (DomainClassifier)

Name Type Required Description
cache_dir str or None No HuggingFace cache directory (default: None)
label_field str No Name of the prediction output column (default: "domain_pred")
score_field str or None No Name of the probability output column; None means scores are not retained
text_field str No Name of the text field in the input data (default: "text")
filter_by list[str] or None No List of domain labels to keep; non-matching documents are removed
max_chars int No Maximum characters before tokenization (default: 2000)
sort_by_length bool No Sort input by token length for performance (default: True)
model_inference_batch_size int No Batch size for model inference (default: 256)
autocast bool No Use autocast for faster inference (default: True)

Inputs (MultilingualDomainClassifier)

Name Type Required Description
cache_dir str or None No HuggingFace cache directory (default: None)
label_field str No Name of the prediction output column (default: "multilingual_domain_pred")
score_field str or None No Name of the probability output column; None means scores are not retained
text_field str No Name of the text field in the input data (default: "text")
filter_by list[str] or None No List of domain labels to keep; non-matching documents are removed
max_chars int No Maximum characters before tokenization (default: 2000)
sort_by_length bool No Sort input by token length for performance (default: True)
model_inference_batch_size int No Batch size for model inference (default: 256)
autocast bool No Use autocast for faster inference (default: True)

Outputs

Name Type Description
DocumentBatch DocumentBatch Input batch augmented with domain prediction column
domain_pred / multilingual_domain_pred column str (per row) Predicted domain label for each document
score_field column list[float] (per row) Probability vector across all domain categories (only if score_field is specified)

Usage Examples

English Domain Classification

from nemo_curator.stages.text.classifiers.domain import DomainClassifier

# Create a domain classifier with default settings
classifier = DomainClassifier()

Multilingual Domain Classification

from nemo_curator.stages.text.classifiers.domain import MultilingualDomainClassifier

# Classify documents across 52 languages
classifier = MultilingualDomainClassifier(
    label_field="multilingual_domain_pred",
    score_field="domain_prob",
)

Domain Filtering

# Only keep documents classified as Science or Technology domains
classifier = DomainClassifier(
    filter_by=["Science", "Technology"],
    model_inference_batch_size=512,
)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment