Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:NVIDIA NeMo Curator FastText Filters

From Leeroopedia
Knowledge Sources
Domains NLP, Language Identification, Data Quality, Filtering
Last Updated 2026-02-14 00:00 GMT

Overview

Provides FastText-based document filters for quality scoring and language identification, enabling lightweight CPU-based filtering in large-scale text curation pipelines.

Description

This module defines two DocumentFilter subclasses that use FastText models for document-level classification:

  • FastTextQualityFilter - Loads a FastText quality classification model from a local file path and scores documents based on their predicted quality. The model predicts a quality label and confidence score. If the predicted label does not match the target label (default: __label__hq), the score is inverted (1 - score) so that higher scores always indicate higher quality. The keep_document method uses a Pareto-distributed randomized threshold: np.random.pareto(alpha) > 1 - score, where alpha (default: 3) controls the aggressiveness of the threshold. This stochastic approach means that higher-quality documents are more likely to be kept, but even lower-quality documents have a chance of being retained.
  • FastTextLangId - Loads a FastText language identification model and extracts the top-1 language code and confidence score. The language code is extracted from the FastText label format (e.g., __label__en becomes EN). The keep_document method applies a simple confidence threshold (default: 0.3), keeping documents whose language identification confidence exceeds the cutoff.

Both filters support lazy model loading via model_check_or_download() (validates model file existence) and load_model() (loads the FastText model into memory). These methods are called during stage setup rather than at instantiation time.

Usage

Use FastTextQualityFilter as an early-stage quality filter in a text curation pipeline. It is lightweight and CPU-based, making it suitable for pre-filtering before more expensive GPU-based classifiers. Use FastTextLangId to identify and filter documents by language, keeping only documents in the target language(s) above a confidence threshold.

Code Reference

Source Location

  • Repository: NeMo-Curator
  • File: nemo_curator/stages/text/filters/fasttext_filter.py
  • Lines: 1-92

Signature

class FastTextQualityFilter(DocumentFilter):
    def __init__(
        self,
        model_path: str | None = None,
        label: str = "__label__hq",
        alpha: float = 3,
        seed: int = 42,
    ): ...
    def model_check_or_download(self) -> None: ...
    def load_model(self) -> None: ...
    def score_document(self, text: str) -> float: ...
    def keep_document(self, score: float) -> bool: ...

class FastTextLangId(DocumentFilter):
    def __init__(
        self,
        model_path: str | None = None,
        min_langid_score: float = 0.3,
    ): ...
    def model_check_or_download(self) -> None: ...
    def load_model(self) -> None: ...
    def score_document(self, text: str) -> list[float | str]: ...
    def keep_document(self, score: float | str) -> bool: ...

Import

from nemo_curator.stages.text.filters.fasttext_filter import FastTextQualityFilter
from nemo_curator.stages.text.filters.fasttext_filter import FastTextLangId

I/O Contract

Inputs (FastTextQualityFilter)

Name Type Required Description
model_path str Yes Path to a local FastText quality model file (raises ValueError if None)
label str No Target quality label in FastText format (default: "__label__hq")
alpha float No Pareto distribution shape parameter for stochastic thresholding (default: 3)
seed int No Random seed for reproducibility (default: 42)

Inputs (FastTextLangId)

Name Type Required Description
model_path str Yes Path to a local FastText language ID model file (raises ValueError if None)
min_langid_score float No Minimum confidence threshold for language identification (default: 0.3)

Outputs

Filter Method Return Type Description
FastTextQualityFilter score_document float Quality score between 0 and 1, where higher means better quality
FastTextQualityFilter keep_document bool True if Pareto-sampled threshold exceeds 1 - score
FastTextLangId score_document str (repr of list) String representation of [confidence_score, language_code]
FastTextLangId keep_document bool True if confidence score >= min_langid_score cutoff

Usage Examples

Quality Filtering

from nemo_curator.stages.text.filters.fasttext_filter import FastTextQualityFilter

# Create quality filter with a FastText model
quality_filter = FastTextQualityFilter(
    model_path="/models/fasttext_quality.bin",
    label="__label__hq",
    alpha=3,
    seed=42,
)

# Lazy model loading (typically called by the stage setup)
quality_filter.model_check_or_download()
quality_filter.load_model()

# Score and filter a document
score = quality_filter.score_document("This is a well-written document about science.")
keep = quality_filter.keep_document(score)

Language Identification

from nemo_curator.stages.text.filters.fasttext_filter import FastTextLangId

# Create language ID filter
lang_filter = FastTextLangId(
    model_path="/models/lid.176.bin",
    min_langid_score=0.3,
)

# Lazy model loading
lang_filter.model_check_or_download()
lang_filter.load_model()

# Score and filter a document
score = lang_filter.score_document("This is an English document.")
keep = lang_filter.keep_document(score)

Filter Details

FastTextQualityFilter Stochastic Thresholding

The quality filter uses a Pareto distribution for its keep/discard decision rather than a fixed threshold. The decision rule is:

np.random.pareto(self._alpha) > 1 - score

This means:

  • Documents with score close to 1.0 (high quality) are almost always kept, since 1 - score is near 0
  • Documents with score close to 0.0 (low quality) are rarely kept, since 1 - score is near 1
  • The alpha parameter controls the distribution shape: higher alpha makes the threshold more aggressive (fewer low-quality documents pass)

FastTextLangId Score Format

The score_document method returns a string representation of a Python list containing the confidence score and the two-character uppercase language code:

# Example return value
"[0.95, 'EN']"

The keep_document method parses this string back into a list and checks if the confidence score meets the minimum threshold. The string format is used to allow backend conversions between different data processing frameworks.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment