Implementation:Deepset ai Haystack DocumentLanguageClassifier

Overview

DocumentLanguageClassifier is a Haystack component that detects the natural language of each document's content and adds a language field to the document's metadata. It uses the langdetect library for statistical language identification and supports routing to language-specific processing branches when combined with MetadataRouter.

Code Reference

Source file: haystack/components/classifiers/document_language_classifier.py, lines 17-110

Import:

from haystack.components.classifiers import DocumentLanguageClassifier

Dependencies: langdetect (install via pip install langdetect)

Constructor

DocumentLanguageClassifier(
    languages: list[str] | None = None
)

Parameters:

languages (list[str] | None, default None): A list of ISO 639-1 language codes to match against (e.g., ["en", "de", "fr"]). If not specified, defaults to ["en"]. Documents whose detected language is not in this list receive a metadata value of "unmatched".

Run Method

run(documents: list[Document]) -> {"documents": list[Document]}

Parameters:

documents (list[Document], required): A list of documents for language classification.

Raises:

TypeError: If input is not a list of Document objects.

Behavior:

Each document receives a new language metadata field.
If the detected language matches one of the configured languages, the metadata is set to the ISO code (e.g., "en").
If the detected language does not match, the metadata is set to "unmatched".
If detection fails, a warning is logged and the document is marked as "unmatched".

I/O Contract

Direction	Name	Type	Description
Input	documents	list[Document]	Documents to classify by language
Output	documents	list[Document]	Documents with `language` metadata field added

Usage Examples

Basic Language Detection

from haystack import Document
from haystack.components.classifiers import DocumentLanguageClassifier

classifier = DocumentLanguageClassifier(languages=["en", "de", "fr"])

docs = [
    Document(content="This is an English document"),
    Document(content="Dies ist ein deutsches Dokument"),
    Document(content="Ceci est un document francais")
]

result = classifier.run(documents=docs)
for doc in result["documents"]:
    print(doc.meta["language"])
# Output: "en", "de", "fr"

Default English-Only Classification

from haystack import Document
from haystack.components.classifiers import DocumentLanguageClassifier

classifier = DocumentLanguageClassifier()  # defaults to ["en"]

docs = [
    Document(content="Hello world"),
    Document(content="Hola mundo")
]

result = classifier.run(documents=docs)
# First doc: meta.language = "en"
# Second doc: meta.language = "unmatched"

Pipeline with Language-Based Routing

from haystack import Document, Pipeline
from haystack.document_stores.in_memory import InMemoryDocumentStore
from haystack.components.classifiers import DocumentLanguageClassifier
from haystack.components.routers import MetadataRouter
from haystack.components.writers import DocumentWriter

docs = [
    Document(id="1", content="This is an English document"),
    Document(id="2", content="Este es un documento en espanol")
]

document_store = InMemoryDocumentStore()

pipeline = Pipeline()
pipeline.add_component(
    "language_classifier",
    DocumentLanguageClassifier(languages=["en"])
)
pipeline.add_component(
    "router",
    MetadataRouter(rules={
        "en": {
            "field": "meta.language",
            "operator": "==",
            "value": "en"
        }
    })
)
pipeline.add_component("writer", DocumentWriter(document_store=document_store))

pipeline.connect("language_classifier.documents", "router.documents")
pipeline.connect("router.en", "writer.documents")

pipeline.run({"language_classifier": {"documents": docs}})

# Only English documents are written to the store
written_docs = document_store.filter_documents()
assert len(written_docs) == 1

Related Pages

Implements Principle

Principle:Deepset_ai_Haystack_Document_Language_Classification

Deepset_ai_Haystack_Document_Language_Classification - The principle behind document language classification
Deepset_ai_Haystack_MetadataRouter - Routes documents based on metadata (including language)
Deepset_ai_Haystack_Metadata_Based_Routing - Principle of metadata-based routing
Deepset_ai_Haystack_DocumentCleaner - Cleans documents before or after classification

Requires Environment

Environment:Deepset_ai_Haystack_HuggingFace_Model_Environment

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment