Implementation:NVIDIA NeMo Curator DocumentFilter
| Knowledge Sources | |
|---|---|
| Domains | Filtering, Data Curation, Abstract Base Class |
| Last Updated | 2026-02-14 00:00 GMT |
Overview
Defines the DocumentFilter abstract base class that all text-based document filters in NeMo Curator must implement, providing the core interface for scoring and filtering documents.
Description
DocumentFilter is an ABC (Abstract Base Class) that establishes the two-phase filtering protocol used throughout the NeMo Curator filtering system:
- score_document(text) - An abstract method that computes a quality or relevance score for a given document text. The return type can be a single float or a list of numeric values, depending on the filter's needs.
- keep_document(scores) - An abstract method that takes the score(s) produced by score_document and returns a boolean indicating whether the document should be retained.
Additionally, DocumentFilter provides property-based access to pre-computed text decompositions that can be shared across filters for efficiency:
- name - A string identifier for the filter (defaults to the class name)
- sentences - Cached sentence-level decomposition of the document
- paragraphs - Cached paragraph-level decomposition of the document
- ngrams - Cached n-gram decomposition of the document
These cached decompositions allow the ScoreFilter and Score processing stages to compute expensive text decompositions once and share them across multiple filters.
Usage
Subclass DocumentFilter to create custom heuristic filters for text data curation. Implement score_document to define your quality metric and keep_document to define the acceptance criteria. The filter will then be compatible with NeMo Curator's ScoreFilter processing stage for batch-level document filtering.
Code Reference
Source Location
- Repository: NeMo-Curator
- File: nemo_curator/stages/text/filters/doc_filter.py
- Lines: 1-104
Signature
class DocumentFilter(ABC):
def __init__(self): ...
@abstractmethod
def score_document(self, text: str) -> float | list[int | float]: ...
@abstractmethod
def keep_document(self, scores: float | list[int | float]) -> bool: ...
@property
def name(self) -> str: ...
@property
def sentences(self) -> list: ...
@sentences.setter
def sentences(self, sentences: list) -> None: ...
@property
def paragraphs(self) -> list: ...
@paragraphs.setter
def paragraphs(self, paragraphs: list) -> None: ...
@property
def ngrams(self) -> dict: ...
@ngrams.setter
def ngrams(self, ngrams: dict) -> None: ...
Import
from nemo_curator.stages.text.filters.doc_filter import DocumentFilter
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| text | str | Yes | The text content of the document to be scored (passed to score_document) |
| scores | float or list[int or float] | Yes | The score(s) returned by score_document (passed to keep_document) |
Outputs
| Method | Return Type | Description |
|---|---|---|
| score_document | float or list[int or float] | A score or set of scores representing the document's quality or relevance |
| keep_document | bool | True if the document should be retained, False otherwise |
Usage Examples
Implementing a Custom Filter
from nemo_curator.stages.text.filters.doc_filter import DocumentFilter
class MinWordCountFilter(DocumentFilter):
def __init__(self, min_words: int = 50):
super().__init__()
self._min_words = min_words
self._name = "min_word_count"
def score_document(self, text: str) -> float:
return len(text.split())
def keep_document(self, score: float) -> bool:
return score >= self._min_words
Using a Filter Directly
filter = MinWordCountFilter(min_words=100)
document_text = "This is a sample document with some text content."
score = filter.score_document(document_text)
should_keep = filter.keep_document(score)
print(f"Score: {score}, Keep: {should_keep}")
class SentenceLengthFilter(DocumentFilter):
def __init__(self, min_avg_sentence_length: float = 5.0):
super().__init__()
self._min_avg = min_avg_sentence_length
self._name = "sentence_length"
def score_document(self, text: str) -> float:
# Uses pre-computed sentences if available
if self.sentences is not None:
sents = self.sentences
else:
sents = text.split(".")
avg_len = sum(len(s.split()) for s in sents) / max(len(sents), 1)
return avg_len
def keep_document(self, score: float) -> bool:
return score >= self._min_avg
Subclass Hierarchy
The following filter classes inherit from DocumentFilter:
Code Filters
- PythonCommentToCodeFilter - Python comment-to-code ratio
- GeneralCommentToCodeFilter - General language comment-to-code ratio
- NumberOfLinesOfCodeFilter - Line count bounds
- TokenizerFertilityFilter - Character-to-token ratio
- XMLHeaderFilter - XML header detection
- AlphaFilter - Alphabetic character ratio
- HTMLBoilerplateFilter - HTML boilerplate detection
- PerExtensionFilter - Per-extension threshold filtering
FastText Filters
- FastTextQualityFilter - FastText-based quality scoring
- FastTextLangId - FastText-based language identification
Related Pages
- NVIDIA_NeMo_Curator_Code_Filters - Code-specific filter implementations
- NVIDIA_NeMo_Curator_FastText_Filters - FastText-based filter implementations
- NVIDIA_NeMo_Curator_ScoreFilter - Processing stage that applies DocumentFilter instances
- Environment:NVIDIA_NeMo_Curator_Python_Linux_Base