Implementation:NVIDIA NeMo Curator DomainClassifier
| Knowledge Sources | |
|---|---|
| Domains | NLP, Classification, Multilingual |
| Last Updated | 2026-02-14 00:00 GMT |
Overview
Implements English and multilingual domain classifiers using DeBERTa-based models from HuggingFace, enabling categorization of text documents into topic domains for data curation pipelines.
Description
This module provides two domain classifier classes, both extending DistributedDataClassifier:
- DomainClassifier - Uses the nvidia/domain-classifier model for English text domain classification. Configured with max sequence length of 512 tokens, max 2000 characters, and the label field
domain_pred.
- MultilingualDomainClassifier - Uses the nvidia/multilingual-domain-classifier model supporting domain classification across 52 languages. Configured with the same token and character limits as the English variant but with the label field
multilingual_domain_pred.
Both classifiers use right-side tokenizer padding (DeBERTa standard) and derive their stage names automatically from their respective model identifiers via format_name_with_suffix. They are optimized for multi-node, multi-GPU distributed inference on large datasets.
Usage
Use DomainClassifier when working with English-only datasets and you need to classify documents into topic domains. Use MultilingualDomainClassifier when working with multilingual datasets that may contain text in any of the 52 supported languages. Both are suitable for filtering datasets to specific domains or for adding domain metadata for downstream analysis.
Code Reference
Source Location
- Repository: NeMo-Curator
- File: nemo_curator/stages/text/classifiers/domain.py
- Lines: 1-129
Signature
DOMAIN_MODEL_IDENTIFIER = "nvidia/domain-classifier"
MULTILINGUAL_DOMAIN_MODEL_IDENTIFIER = "nvidia/multilingual-domain-classifier"
MAX_SEQ_LENGTH = 512
class DomainClassifier(DistributedDataClassifier):
def __init__(
self,
cache_dir: str | None = None,
label_field: str = "domain_pred",
score_field: str | None = None,
text_field: str = "text",
filter_by: list[str] | None = None,
max_chars: int = 2000,
sort_by_length: bool = True,
model_inference_batch_size: int = 256,
autocast: bool = True,
): ...
class MultilingualDomainClassifier(DistributedDataClassifier):
def __init__(
self,
cache_dir: str | None = None,
label_field: str = "multilingual_domain_pred",
score_field: str | None = None,
text_field: str = "text",
filter_by: list[str] | None = None,
max_chars: int = 2000,
sort_by_length: bool = True,
model_inference_batch_size: int = 256,
autocast: bool = True,
): ...
Import
from nemo_curator.stages.text.classifiers.domain import DomainClassifier
from nemo_curator.stages.text.classifiers.domain import MultilingualDomainClassifier
I/O Contract
Inputs (DomainClassifier)
| Name | Type | Required | Description |
|---|---|---|---|
| cache_dir | str or None | No | HuggingFace cache directory (default: None) |
| label_field | str | No | Name of the prediction output column (default: "domain_pred") |
| score_field | str or None | No | Name of the probability output column; None means scores are not retained |
| text_field | str | No | Name of the text field in the input data (default: "text") |
| filter_by | list[str] or None | No | List of domain labels to keep; non-matching documents are removed |
| max_chars | int | No | Maximum characters before tokenization (default: 2000) |
| sort_by_length | bool | No | Sort input by token length for performance (default: True) |
| model_inference_batch_size | int | No | Batch size for model inference (default: 256) |
| autocast | bool | No | Use autocast for faster inference (default: True) |
Inputs (MultilingualDomainClassifier)
| Name | Type | Required | Description |
|---|---|---|---|
| cache_dir | str or None | No | HuggingFace cache directory (default: None) |
| label_field | str | No | Name of the prediction output column (default: "multilingual_domain_pred") |
| score_field | str or None | No | Name of the probability output column; None means scores are not retained |
| text_field | str | No | Name of the text field in the input data (default: "text") |
| filter_by | list[str] or None | No | List of domain labels to keep; non-matching documents are removed |
| max_chars | int | No | Maximum characters before tokenization (default: 2000) |
| sort_by_length | bool | No | Sort input by token length for performance (default: True) |
| model_inference_batch_size | int | No | Batch size for model inference (default: 256) |
| autocast | bool | No | Use autocast for faster inference (default: True) |
Outputs
| Name | Type | Description |
|---|---|---|
| DocumentBatch | DocumentBatch | Input batch augmented with domain prediction column |
| domain_pred / multilingual_domain_pred column | str (per row) | Predicted domain label for each document |
| score_field column | list[float] (per row) | Probability vector across all domain categories (only if score_field is specified) |
Usage Examples
English Domain Classification
from nemo_curator.stages.text.classifiers.domain import DomainClassifier
# Create a domain classifier with default settings
classifier = DomainClassifier()
Multilingual Domain Classification
from nemo_curator.stages.text.classifiers.domain import MultilingualDomainClassifier
# Classify documents across 52 languages
classifier = MultilingualDomainClassifier(
label_field="multilingual_domain_pred",
score_field="domain_prob",
)
Domain Filtering
# Only keep documents classified as Science or Technology domains
classifier = DomainClassifier(
filter_by=["Science", "Technology"],
model_inference_batch_size=512,
)
Related Pages
- NVIDIA_NeMo_Curator_BaseClassifierStage - Parent class providing the DistributedDataClassifier infrastructure
- NVIDIA_NeMo_Curator_ContentTypeClassifier - Sibling classifier for content type classification
- NVIDIA_NeMo_Curator_QualityClassifier - Sibling classifier for text quality assessment
- Environment:NVIDIA_NeMo_Curator_Python_Linux_Base