Implementation:NVIDIA NeMo Curator DocumentFilter

Knowledge Sources	NVIDIA NeMo Curator
Domains	Filtering, Data Curation, Abstract Base Class
Last Updated	2026-02-14 00:00 GMT

Overview

Defines the DocumentFilter abstract base class that all text-based document filters in NeMo Curator must implement, providing the core interface for scoring and filtering documents.

Description

DocumentFilter is an ABC (Abstract Base Class) that establishes the two-phase filtering protocol used throughout the NeMo Curator filtering system:

score_document(text) - An abstract method that computes a quality or relevance score for a given document text. The return type can be a single float or a list of numeric values, depending on the filter's needs.

keep_document(scores) - An abstract method that takes the score(s) produced by score_document and returns a boolean indicating whether the document should be retained.

Additionally, DocumentFilter provides property-based access to pre-computed text decompositions that can be shared across filters for efficiency:

name - A string identifier for the filter (defaults to the class name)
sentences - Cached sentence-level decomposition of the document
paragraphs - Cached paragraph-level decomposition of the document
ngrams - Cached n-gram decomposition of the document

These cached decompositions allow the ScoreFilter and Score processing stages to compute expensive text decompositions once and share them across multiple filters.

Usage

Subclass DocumentFilter to create custom heuristic filters for text data curation. Implement score_document to define your quality metric and keep_document to define the acceptance criteria. The filter will then be compatible with NeMo Curator's ScoreFilter processing stage for batch-level document filtering.

Code Reference

Source Location

Repository: NeMo-Curator
File: nemo_curator/stages/text/filters/doc_filter.py
Lines: 1-104

Signature

class DocumentFilter(ABC):
    def __init__(self): ...

    @abstractmethod
    def score_document(self, text: str) -> float | list[int | float]: ...

    @abstractmethod
    def keep_document(self, scores: float | list[int | float]) -> bool: ...

    @property
    def name(self) -> str: ...

    @property
    def sentences(self) -> list: ...

    @sentences.setter
    def sentences(self, sentences: list) -> None: ...

    @property
    def paragraphs(self) -> list: ...

    @paragraphs.setter
    def paragraphs(self, paragraphs: list) -> None: ...

    @property
    def ngrams(self) -> dict: ...

    @ngrams.setter
    def ngrams(self, ngrams: dict) -> None: ...

Import

from nemo_curator.stages.text.filters.doc_filter import DocumentFilter

I/O Contract

Inputs

Name	Type	Required	Description
text	str	Yes	The text content of the document to be scored (passed to score_document)
scores	float or list[int or float]	Yes	The score(s) returned by score_document (passed to keep_document)

Outputs

Method	Return Type	Description
score_document	float or list[int or float]	A score or set of scores representing the document's quality or relevance
keep_document	bool	True if the document should be retained, False otherwise

Usage Examples

Implementing a Custom Filter

from nemo_curator.stages.text.filters.doc_filter import DocumentFilter

class MinWordCountFilter(DocumentFilter):
    def __init__(self, min_words: int = 50):
        super().__init__()
        self._min_words = min_words
        self._name = "min_word_count"

    def score_document(self, text: str) -> float:
        return len(text.split())

    def keep_document(self, score: float) -> bool:
        return score >= self._min_words

Using a Filter Directly

filter = MinWordCountFilter(min_words=100)

document_text = "This is a sample document with some text content."
score = filter.score_document(document_text)
should_keep = filter.keep_document(score)
print(f"Score: {score}, Keep: {should_keep}")

Using Shared Text Decompositions

class SentenceLengthFilter(DocumentFilter):
    def __init__(self, min_avg_sentence_length: float = 5.0):
        super().__init__()
        self._min_avg = min_avg_sentence_length
        self._name = "sentence_length"

    def score_document(self, text: str) -> float:
        # Uses pre-computed sentences if available
        if self.sentences is not None:
            sents = self.sentences
        else:
            sents = text.split(".")
        avg_len = sum(len(s.split()) for s in sents) / max(len(sents), 1)
        return avg_len

    def keep_document(self, score: float) -> bool:
        return score >= self._min_avg

Subclass Hierarchy

The following filter classes inherit from DocumentFilter:

Code Filters

PythonCommentToCodeFilter - Python comment-to-code ratio
GeneralCommentToCodeFilter - General language comment-to-code ratio
NumberOfLinesOfCodeFilter - Line count bounds
TokenizerFertilityFilter - Character-to-token ratio
XMLHeaderFilter - XML header detection
AlphaFilter - Alphabetic character ratio
HTMLBoilerplateFilter - HTML boilerplate detection
PerExtensionFilter - Per-extension threshold filtering

FastText Filters

FastTextQualityFilter - FastText-based quality scoring
FastTextLangId - FastText-based language identification

Related Pages

NVIDIA_NeMo_Curator_Code_Filters - Code-specific filter implementations
NVIDIA_NeMo_Curator_FastText_Filters - FastText-based filter implementations
NVIDIA_NeMo_Curator_ScoreFilter - Processing stage that applies DocumentFilter instances
Environment:NVIDIA_NeMo_Curator_Python_Linux_Base

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment