Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Deepset ai Haystack DocumentLanguageClassifier

From Leeroopedia

Template:Metadata

Overview

DocumentLanguageClassifier is a Haystack component that detects the natural language of each document's content and adds a language field to the document's metadata. It uses the langdetect library for statistical language identification and supports routing to language-specific processing branches when combined with MetadataRouter.

Code Reference

Source file: haystack/components/classifiers/document_language_classifier.py, lines 17-110

Import:

from haystack.components.classifiers import DocumentLanguageClassifier

Dependencies: langdetect (install via pip install langdetect)

Constructor

DocumentLanguageClassifier(
    languages: list[str] | None = None
)

Parameters:

  • languages (list[str] | None, default None): A list of ISO 639-1 language codes to match against (e.g., ["en", "de", "fr"]). If not specified, defaults to ["en"]. Documents whose detected language is not in this list receive a metadata value of "unmatched".

Run Method

run(documents: list[Document]) -> {"documents": list[Document]}

Parameters:

  • documents (list[Document], required): A list of documents for language classification.

Raises:

  • TypeError: If input is not a list of Document objects.

Behavior:

  • Each document receives a new language metadata field.
  • If the detected language matches one of the configured languages, the metadata is set to the ISO code (e.g., "en").
  • If the detected language does not match, the metadata is set to "unmatched".
  • If detection fails, a warning is logged and the document is marked as "unmatched".

I/O Contract

Direction Name Type Description
Input documents list[Document] Documents to classify by language
Output documents list[Document] Documents with language metadata field added

Usage Examples

Basic Language Detection

from haystack import Document
from haystack.components.classifiers import DocumentLanguageClassifier

classifier = DocumentLanguageClassifier(languages=["en", "de", "fr"])

docs = [
    Document(content="This is an English document"),
    Document(content="Dies ist ein deutsches Dokument"),
    Document(content="Ceci est un document francais")
]

result = classifier.run(documents=docs)
for doc in result["documents"]:
    print(doc.meta["language"])
# Output: "en", "de", "fr"

Default English-Only Classification

from haystack import Document
from haystack.components.classifiers import DocumentLanguageClassifier

classifier = DocumentLanguageClassifier()  # defaults to ["en"]

docs = [
    Document(content="Hello world"),
    Document(content="Hola mundo")
]

result = classifier.run(documents=docs)
# First doc: meta.language = "en"
# Second doc: meta.language = "unmatched"

Pipeline with Language-Based Routing

from haystack import Document, Pipeline
from haystack.document_stores.in_memory import InMemoryDocumentStore
from haystack.components.classifiers import DocumentLanguageClassifier
from haystack.components.routers import MetadataRouter
from haystack.components.writers import DocumentWriter

docs = [
    Document(id="1", content="This is an English document"),
    Document(id="2", content="Este es un documento en espanol")
]

document_store = InMemoryDocumentStore()

pipeline = Pipeline()
pipeline.add_component(
    "language_classifier",
    DocumentLanguageClassifier(languages=["en"])
)
pipeline.add_component(
    "router",
    MetadataRouter(rules={
        "en": {
            "field": "meta.language",
            "operator": "==",
            "value": "en"
        }
    })
)
pipeline.add_component("writer", DocumentWriter(document_store=document_store))

pipeline.connect("language_classifier.documents", "router.documents")
pipeline.connect("router.en", "writer.documents")

pipeline.run({"language_classifier": {"documents": docs}})

# Only English documents are written to the store
written_docs = document_store.filter_documents()
assert len(written_docs) == 1

Related Pages

Implements Principle

Requires Environment

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment