Implementation:Huggingface Datatrove LanguageFilter
| Knowledge Sources | |
|---|---|
| Domains | Language_Identification, NLP, Data_Filtering |
| Type | Filter Module |
| Last Updated | 2026-02-14 00:00 GMT |
Overview
Concrete filter class that identifies the natural language of each document using a FastText-based language identification model and removes documents that do not match specified target languages or fall below a confidence threshold.
Description
The LanguageFilter class extends BaseFilter and wraps two FastText-based language identification backends: FT176LID (176-language model) and GlotLID (broader coverage with script detection). On each call to filter(), the model predicts language scores for the document text, annotates the document metadata with language and language_score, and returns whether the document should be kept.
Operational modes:
- Filter mode (default,
label_only=False) -- Documents are rejected if no target language exceeds the threshold (when languages are specified) or if the top language score is below the threshold (when languages is None). - Label-only mode (
label_only=True) -- All documents are kept, but language metadata is annotated. Useful for analysis pipelines.
GlotLID-specific behavior: When using the glotlid backend, the predicted label has the format lang_script (e.g., eng_Latn). The filter splits this into separate language and language_script metadata fields.
Top-pairs tracking: When keep_top_pairs_threshold is set to a non-negative value, all language predictions with scores above that threshold are stored in metadata as top_language_{lang}_score keys.
Usage
Use LanguageFilter after text extraction to restrict a datatrove pipeline to specific target languages, or in label-only mode to annotate documents with language metadata for downstream analysis.
Code Reference
Source Location
- Repository: Huggingface_Datatrove
- File: src/datatrove/pipeline/filters/language_filter.py
- Lines: 9-65
Signature
class LanguageFilter(BaseFilter):
name = "Language ID"
_requires_dependencies = [("fasttext", "fasttext-numpy2-wheel"), "fasteners"]
def __init__(
self,
languages: list[str] | str | None = None,
language_threshold: float = 0.65,
exclusion_writer: DiskWriter = None,
backend: Literal["ft176", "glotlid"] = "ft176",
label_only: bool = False,
keep_top_pairs_threshold: float = -1,
):
...
def filter(self, doc: Document) -> bool:
...
Import
from datatrove.pipeline.filters import LanguageFilter
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| languages | str | None | No (default: None) | List of language codes to keep (e.g., ["en", "fr"]). A single string is auto-wrapped in a list. None accepts all languages above the threshold.
|
| language_threshold | float | No (default: 0.65) | Minimum confidence score to accept a document |
| exclusion_writer | DiskWriter | No (default: None) | Optional writer to save rejected documents |
| backend | Literal["ft176", "glotlid"] | No (default: "ft176") | Language identification model backend |
| label_only | bool | No (default: False) | If True, annotate language metadata without removing any documents |
| keep_top_pairs_threshold | float | No (default: -1) | Store all language predictions above this score in metadata. Set to -1 to disable. |
Pipeline Input: A Document object with plain text in its .text field.
Outputs
| Name | Type | Description |
|---|---|---|
| bool | bool | True if the document passes language criteria and should be kept
|
Metadata annotations added to each document:
| Key | Type | Description |
|---|---|---|
language |
str | Predicted language code (e.g., "en", "fr")
|
language_score |
float | Confidence score for the predicted language |
language_script |
str | Script identifier (only when backend="glotlid", e.g., "Latn")
|
top_language_{lang}_score |
float | Score for each language above keep_top_pairs_threshold (when enabled)
|
Usage Examples
Filter to English Only
from datatrove.pipeline.filters import LanguageFilter
lang_filter = LanguageFilter(
languages=["en"],
language_threshold=0.65,
)
Multi-Language Filter with GlotLID
from datatrove.pipeline.filters import LanguageFilter
lang_filter = LanguageFilter(
languages=["en", "fr", "de", "es"],
language_threshold=0.5,
backend="glotlid",
)
Label-Only Mode for Analysis
from datatrove.pipeline.filters import LanguageFilter
# Annotate all documents with language info without removing any
lang_annotator = LanguageFilter(
label_only=True,
keep_top_pairs_threshold=0.1,
)
Full Pipeline Example
from datatrove.executor import LocalPipelineExecutor
from datatrove.pipeline.extractors import Trafilatura
from datatrove.pipeline.filters import LanguageFilter, URLFilter
from datatrove.pipeline.readers import WarcReader
pipeline = LocalPipelineExecutor(
pipeline=[
WarcReader("s3://commoncrawl/crawl-data/CC-MAIN-2024-10/"),
Trafilatura(favour_precision=True, timeout=1),
URLFilter(),
LanguageFilter(languages=["en"], language_threshold=0.65),
],
tasks=100,
)
pipeline.run()
Related Pages
- Huggingface_Datatrove_Language_Filtering (principle) -- The principle this implementation realizes
- Huggingface_Datatrove_Trafilatura (upstream step) -- HTML text extraction that produces the plain text input
- Huggingface_Datatrove_URLFilter (upstream filter) -- URL-based filtering applied before language filtering
- Huggingface_Datatrove_SamplerFilter (downstream filter) -- Random sampling that may follow language filtering