Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:NVIDIA NeMo Curator DocumentFilter

From Leeroopedia
Knowledge Sources
Domains Filtering, Data Curation, Abstract Base Class
Last Updated 2026-02-14 00:00 GMT

Overview

Defines the DocumentFilter abstract base class that all text-based document filters in NeMo Curator must implement, providing the core interface for scoring and filtering documents.

Description

DocumentFilter is an ABC (Abstract Base Class) that establishes the two-phase filtering protocol used throughout the NeMo Curator filtering system:

  1. score_document(text) - An abstract method that computes a quality or relevance score for a given document text. The return type can be a single float or a list of numeric values, depending on the filter's needs.
  1. keep_document(scores) - An abstract method that takes the score(s) produced by score_document and returns a boolean indicating whether the document should be retained.

Additionally, DocumentFilter provides property-based access to pre-computed text decompositions that can be shared across filters for efficiency:

  • name - A string identifier for the filter (defaults to the class name)
  • sentences - Cached sentence-level decomposition of the document
  • paragraphs - Cached paragraph-level decomposition of the document
  • ngrams - Cached n-gram decomposition of the document

These cached decompositions allow the ScoreFilter and Score processing stages to compute expensive text decompositions once and share them across multiple filters.

Usage

Subclass DocumentFilter to create custom heuristic filters for text data curation. Implement score_document to define your quality metric and keep_document to define the acceptance criteria. The filter will then be compatible with NeMo Curator's ScoreFilter processing stage for batch-level document filtering.

Code Reference

Source Location

  • Repository: NeMo-Curator
  • File: nemo_curator/stages/text/filters/doc_filter.py
  • Lines: 1-104

Signature

class DocumentFilter(ABC):
    def __init__(self): ...

    @abstractmethod
    def score_document(self, text: str) -> float | list[int | float]: ...

    @abstractmethod
    def keep_document(self, scores: float | list[int | float]) -> bool: ...

    @property
    def name(self) -> str: ...

    @property
    def sentences(self) -> list: ...

    @sentences.setter
    def sentences(self, sentences: list) -> None: ...

    @property
    def paragraphs(self) -> list: ...

    @paragraphs.setter
    def paragraphs(self, paragraphs: list) -> None: ...

    @property
    def ngrams(self) -> dict: ...

    @ngrams.setter
    def ngrams(self, ngrams: dict) -> None: ...

Import

from nemo_curator.stages.text.filters.doc_filter import DocumentFilter

I/O Contract

Inputs

Name Type Required Description
text str Yes The text content of the document to be scored (passed to score_document)
scores float or list[int or float] Yes The score(s) returned by score_document (passed to keep_document)

Outputs

Method Return Type Description
score_document float or list[int or float] A score or set of scores representing the document's quality or relevance
keep_document bool True if the document should be retained, False otherwise

Usage Examples

Implementing a Custom Filter

from nemo_curator.stages.text.filters.doc_filter import DocumentFilter

class MinWordCountFilter(DocumentFilter):
    def __init__(self, min_words: int = 50):
        super().__init__()
        self._min_words = min_words
        self._name = "min_word_count"

    def score_document(self, text: str) -> float:
        return len(text.split())

    def keep_document(self, score: float) -> bool:
        return score >= self._min_words

Using a Filter Directly

filter = MinWordCountFilter(min_words=100)

document_text = "This is a sample document with some text content."
score = filter.score_document(document_text)
should_keep = filter.keep_document(score)
print(f"Score: {score}, Keep: {should_keep}")

Using Shared Text Decompositions

class SentenceLengthFilter(DocumentFilter):
    def __init__(self, min_avg_sentence_length: float = 5.0):
        super().__init__()
        self._min_avg = min_avg_sentence_length
        self._name = "sentence_length"

    def score_document(self, text: str) -> float:
        # Uses pre-computed sentences if available
        if self.sentences is not None:
            sents = self.sentences
        else:
            sents = text.split(".")
        avg_len = sum(len(s.split()) for s in sents) / max(len(sents), 1)
        return avg_len

    def keep_document(self, score: float) -> bool:
        return score >= self._min_avg

Subclass Hierarchy

The following filter classes inherit from DocumentFilter:

Code Filters

  • PythonCommentToCodeFilter - Python comment-to-code ratio
  • GeneralCommentToCodeFilter - General language comment-to-code ratio
  • NumberOfLinesOfCodeFilter - Line count bounds
  • TokenizerFertilityFilter - Character-to-token ratio
  • XMLHeaderFilter - XML header detection
  • AlphaFilter - Alphabetic character ratio
  • HTMLBoilerplateFilter - HTML boilerplate detection
  • PerExtensionFilter - Per-extension threshold filtering

FastText Filters

  • FastTextQualityFilter - FastText-based quality scoring
  • FastTextLangId - FastText-based language identification

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment