Implementation:NVIDIA NeMo Curator FastText Filters
| Knowledge Sources | |
|---|---|
| Domains | NLP, Language Identification, Data Quality, Filtering |
| Last Updated | 2026-02-14 00:00 GMT |
Overview
Provides FastText-based document filters for quality scoring and language identification, enabling lightweight CPU-based filtering in large-scale text curation pipelines.
Description
This module defines two DocumentFilter subclasses that use FastText models for document-level classification:
- FastTextQualityFilter - Loads a FastText quality classification model from a local file path and scores documents based on their predicted quality. The model predicts a quality label and confidence score. If the predicted label does not match the target label (default:
__label__hq), the score is inverted (1 - score) so that higher scores always indicate higher quality. The keep_document method uses a Pareto-distributed randomized threshold:np.random.pareto(alpha) > 1 - score, where alpha (default: 3) controls the aggressiveness of the threshold. This stochastic approach means that higher-quality documents are more likely to be kept, but even lower-quality documents have a chance of being retained.
- FastTextLangId - Loads a FastText language identification model and extracts the top-1 language code and confidence score. The language code is extracted from the FastText label format (e.g.,
__label__enbecomesEN). The keep_document method applies a simple confidence threshold (default: 0.3), keeping documents whose language identification confidence exceeds the cutoff.
Both filters support lazy model loading via model_check_or_download() (validates model file existence) and load_model() (loads the FastText model into memory). These methods are called during stage setup rather than at instantiation time.
Usage
Use FastTextQualityFilter as an early-stage quality filter in a text curation pipeline. It is lightweight and CPU-based, making it suitable for pre-filtering before more expensive GPU-based classifiers. Use FastTextLangId to identify and filter documents by language, keeping only documents in the target language(s) above a confidence threshold.
Code Reference
Source Location
- Repository: NeMo-Curator
- File: nemo_curator/stages/text/filters/fasttext_filter.py
- Lines: 1-92
Signature
class FastTextQualityFilter(DocumentFilter):
def __init__(
self,
model_path: str | None = None,
label: str = "__label__hq",
alpha: float = 3,
seed: int = 42,
): ...
def model_check_or_download(self) -> None: ...
def load_model(self) -> None: ...
def score_document(self, text: str) -> float: ...
def keep_document(self, score: float) -> bool: ...
class FastTextLangId(DocumentFilter):
def __init__(
self,
model_path: str | None = None,
min_langid_score: float = 0.3,
): ...
def model_check_or_download(self) -> None: ...
def load_model(self) -> None: ...
def score_document(self, text: str) -> list[float | str]: ...
def keep_document(self, score: float | str) -> bool: ...
Import
from nemo_curator.stages.text.filters.fasttext_filter import FastTextQualityFilter
from nemo_curator.stages.text.filters.fasttext_filter import FastTextLangId
I/O Contract
Inputs (FastTextQualityFilter)
| Name | Type | Required | Description |
|---|---|---|---|
| model_path | str | Yes | Path to a local FastText quality model file (raises ValueError if None) |
| label | str | No | Target quality label in FastText format (default: "__label__hq") |
| alpha | float | No | Pareto distribution shape parameter for stochastic thresholding (default: 3) |
| seed | int | No | Random seed for reproducibility (default: 42) |
Inputs (FastTextLangId)
| Name | Type | Required | Description |
|---|---|---|---|
| model_path | str | Yes | Path to a local FastText language ID model file (raises ValueError if None) |
| min_langid_score | float | No | Minimum confidence threshold for language identification (default: 0.3) |
Outputs
| Filter | Method | Return Type | Description |
|---|---|---|---|
| FastTextQualityFilter | score_document | float | Quality score between 0 and 1, where higher means better quality |
| FastTextQualityFilter | keep_document | bool | True if Pareto-sampled threshold exceeds 1 - score |
| FastTextLangId | score_document | str (repr of list) | String representation of [confidence_score, language_code] |
| FastTextLangId | keep_document | bool | True if confidence score >= min_langid_score cutoff |
Usage Examples
Quality Filtering
from nemo_curator.stages.text.filters.fasttext_filter import FastTextQualityFilter
# Create quality filter with a FastText model
quality_filter = FastTextQualityFilter(
model_path="/models/fasttext_quality.bin",
label="__label__hq",
alpha=3,
seed=42,
)
# Lazy model loading (typically called by the stage setup)
quality_filter.model_check_or_download()
quality_filter.load_model()
# Score and filter a document
score = quality_filter.score_document("This is a well-written document about science.")
keep = quality_filter.keep_document(score)
Language Identification
from nemo_curator.stages.text.filters.fasttext_filter import FastTextLangId
# Create language ID filter
lang_filter = FastTextLangId(
model_path="/models/lid.176.bin",
min_langid_score=0.3,
)
# Lazy model loading
lang_filter.model_check_or_download()
lang_filter.load_model()
# Score and filter a document
score = lang_filter.score_document("This is an English document.")
keep = lang_filter.keep_document(score)
Filter Details
FastTextQualityFilter Stochastic Thresholding
The quality filter uses a Pareto distribution for its keep/discard decision rather than a fixed threshold. The decision rule is:
np.random.pareto(self._alpha) > 1 - score
This means:
- Documents with score close to 1.0 (high quality) are almost always kept, since
1 - scoreis near 0 - Documents with score close to 0.0 (low quality) are rarely kept, since
1 - scoreis near 1 - The alpha parameter controls the distribution shape: higher alpha makes the threshold more aggressive (fewer low-quality documents pass)
FastTextLangId Score Format
The score_document method returns a string representation of a Python list containing the confidence score and the two-character uppercase language code:
# Example return value
"[0.95, 'EN']"
The keep_document method parses this string back into a list and checks if the confidence score meets the minimum threshold. The string format is used to allow backend conversions between different data processing frameworks.
Related Pages
- NVIDIA_NeMo_Curator_DocumentFilter - Abstract base class that both filters implement
- NVIDIA_NeMo_Curator_Code_Filters - Sibling module with code-specific filters
- NVIDIA_NeMo_Curator_ScoreFilter - Processing stage that applies DocumentFilter instances
- Environment:NVIDIA_NeMo_Curator_Python_Linux_Base