Implementation:Huggingface Datatrove FastTextClassifierFilter

Knowledge Sources	Huggingface_Datatrove
Domains	Data Processing, Text Classification, Text Filtering
Last Updated	2026-02-14 17:00 GMT

Overview

FastTextClassifierFilter is a document filter that uses a FastText classification model to keep or remove documents (or sub-document spans) based on predicted label scores exceeding configurable thresholds.

Description

FastTextClassifierFilter extends BaseFilter to provide machine-learning-based document filtering using Facebook's FastText library. It loads a pre-trained FastText classifier model (either from a URL or a local path) and uses it to predict label scores for each document. The filter supports two mutually exclusive modes: keep_labels mode, which retains documents that have at least one specified label above a minimum score threshold, and remove_labels mode, which drops documents that have any specified label above the threshold.

A powerful feature of this filter is its filter_mode parameter, which controls the granularity of classification. Documents can be filtered at the DOCUMENT level (whole text), PARAGRAPH level, or SENTENCE level. When operating at sub-document granularity, the filter predicts scores for each span independently, keeps only the spans that pass, and reconstructs the document text from the kept spans. This enables fine-grained content filtering where only problematic paragraphs or sentences are removed rather than discarding the entire document.

The filter also supports metadata annotation: when save_labels_in_metadata is enabled (the default), the average score for each predicted label across all spans is stored in the document's metadata dictionary. The FastText model is lazily loaded on first use via a property accessor, and the model file is cached locally using Datatrove's cached_asset_path_or_download utility. Label names in configuration omit the __label__ prefix that FastText uses internally.

Usage

Use FastTextClassifierFilter when you need to filter documents based on a trained text classifier, such as filtering by language, topic, quality, or content category. It is especially useful for large-scale pipelines where a lightweight FastText model can efficiently classify millions of documents.

Code Reference

Source Location

Repository: Huggingface_Datatrove
File: src/datatrove/pipeline/filters/fasttext_filter.py
Lines: 1-112

Signature

class FastTextClassifierFilter(BaseFilter):
    name = "🤖 fastText"
    _requires_dependencies = [("fasttext", "fasttext-numpy2-wheel"), "fasteners"]

    def __init__(
        self,
        model_url: str,
        keep_labels: Tuple[str, float] | list[Tuple[str, float]] | None = None,
        remove_labels: Tuple[str, float] | list[Tuple[str, float]] | None = None,
        save_labels_in_metadata: bool = True,
        exclusion_writer: DiskWriter | None = None,
        newline_replacement="",
        filter_mode: str = SPLIT_TEXT_DOCUMENTS,
    ):
        ...

    def filter(self, doc: Document) -> bool:
        ...

Import

from datatrove.pipeline.filters.fasttext_filter import FastTextClassifierFilter

I/O Contract

Inputs

Name	Type	Required	Description
model_url	str	Yes	URL to download the FastText model from, or a local file path
keep_labels	Tuple[str, float] or list thereof	No	Labels and minimum scores to keep (mutually exclusive with remove_labels)
remove_labels	Tuple[str, float] or list thereof	No	Labels and minimum scores to remove (mutually exclusive with keep_labels)
save_labels_in_metadata	bool	No	Whether to save average label scores in document metadata (default: True)
exclusion_writer	DiskWriter	No	Optional writer for saving dropped documents
newline_replacement	str	No	String to replace newlines with before prediction (default: empty string)
filter_mode	str	No	Granularity of filtering: DOCUMENT, PARAGRAPH, or SENTENCE (default: SPLIT_TEXT_DOCUMENTS)

Outputs

Name	Type	Description
data	DocumentsPipeline (generator)	Yields documents that pass the label score filter, with text potentially trimmed to kept spans

Usage Examples

Basic Usage

from datatrove.pipeline.filters.fasttext_filter import FastTextClassifierFilter

# Keep only documents classified as "math" with at least 0.9 confidence
math_filter = FastTextClassifierFilter(
    model_url="https://example.com/my_fasttext_model.bin",
    keep_labels=[("math", 0.9)],
)

# Remove documents classified as "spam" with at least 0.8 confidence
spam_filter = FastTextClassifierFilter(
    model_url="/path/to/local/model.bin",
    remove_labels=[("spam", 0.8)],
    filter_mode="paragraph",
)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment