Implementation:Huggingface Datatrove FastTextClassifierFilter
| Knowledge Sources | |
|---|---|
| Domains | Data Processing, Text Classification, Text Filtering |
| Last Updated | 2026-02-14 17:00 GMT |
Overview
FastTextClassifierFilter is a document filter that uses a FastText classification model to keep or remove documents (or sub-document spans) based on predicted label scores exceeding configurable thresholds.
Description
FastTextClassifierFilter extends BaseFilter to provide machine-learning-based document filtering using Facebook's FastText library. It loads a pre-trained FastText classifier model (either from a URL or a local path) and uses it to predict label scores for each document. The filter supports two mutually exclusive modes: keep_labels mode, which retains documents that have at least one specified label above a minimum score threshold, and remove_labels mode, which drops documents that have any specified label above the threshold.
A powerful feature of this filter is its filter_mode parameter, which controls the granularity of classification. Documents can be filtered at the DOCUMENT level (whole text), PARAGRAPH level, or SENTENCE level. When operating at sub-document granularity, the filter predicts scores for each span independently, keeps only the spans that pass, and reconstructs the document text from the kept spans. This enables fine-grained content filtering where only problematic paragraphs or sentences are removed rather than discarding the entire document.
The filter also supports metadata annotation: when save_labels_in_metadata is enabled (the default), the average score for each predicted label across all spans is stored in the document's metadata dictionary. The FastText model is lazily loaded on first use via a property accessor, and the model file is cached locally using Datatrove's cached_asset_path_or_download utility. Label names in configuration omit the __label__ prefix that FastText uses internally.
Usage
Use FastTextClassifierFilter when you need to filter documents based on a trained text classifier, such as filtering by language, topic, quality, or content category. It is especially useful for large-scale pipelines where a lightweight FastText model can efficiently classify millions of documents.
Code Reference
Source Location
- Repository: Huggingface_Datatrove
- File: src/datatrove/pipeline/filters/fasttext_filter.py
- Lines: 1-112
Signature
class FastTextClassifierFilter(BaseFilter):
name = "🤖 fastText"
_requires_dependencies = [("fasttext", "fasttext-numpy2-wheel"), "fasteners"]
def __init__(
self,
model_url: str,
keep_labels: Tuple[str, float] | list[Tuple[str, float]] | None = None,
remove_labels: Tuple[str, float] | list[Tuple[str, float]] | None = None,
save_labels_in_metadata: bool = True,
exclusion_writer: DiskWriter | None = None,
newline_replacement="",
filter_mode: str = SPLIT_TEXT_DOCUMENTS,
):
...
def filter(self, doc: Document) -> bool:
...
Import
from datatrove.pipeline.filters.fasttext_filter import FastTextClassifierFilter
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| model_url | str | Yes | URL to download the FastText model from, or a local file path |
| keep_labels | Tuple[str, float] or list thereof | No | Labels and minimum scores to keep (mutually exclusive with remove_labels) |
| remove_labels | Tuple[str, float] or list thereof | No | Labels and minimum scores to remove (mutually exclusive with keep_labels) |
| save_labels_in_metadata | bool | No | Whether to save average label scores in document metadata (default: True) |
| exclusion_writer | DiskWriter | No | Optional writer for saving dropped documents |
| newline_replacement | str | No | String to replace newlines with before prediction (default: empty string) |
| filter_mode | str | No | Granularity of filtering: DOCUMENT, PARAGRAPH, or SENTENCE (default: SPLIT_TEXT_DOCUMENTS) |
Outputs
| Name | Type | Description |
|---|---|---|
| data | DocumentsPipeline (generator) | Yields documents that pass the label score filter, with text potentially trimmed to kept spans |
Usage Examples
Basic Usage
from datatrove.pipeline.filters.fasttext_filter import FastTextClassifierFilter
# Keep only documents classified as "math" with at least 0.9 confidence
math_filter = FastTextClassifierFilter(
model_url="https://example.com/my_fasttext_model.bin",
keep_labels=[("math", 0.9)],
)
# Remove documents classified as "spam" with at least 0.8 confidence
spam_filter = FastTextClassifierFilter(
model_url="/path/to/local/model.bin",
remove_labels=[("spam", 0.8)],
filter_mode="paragraph",
)