Implementation:Deepset ai Haystack DocumentLanguageClassifier
Appearance
Overview
DocumentLanguageClassifier is a Haystack component that detects the natural language of each document's content and adds a language field to the document's metadata. It uses the langdetect library for statistical language identification and supports routing to language-specific processing branches when combined with MetadataRouter.
Code Reference
Source file: haystack/components/classifiers/document_language_classifier.py, lines 17-110
Import:
from haystack.components.classifiers import DocumentLanguageClassifier
Dependencies: langdetect (install via pip install langdetect)
Constructor
DocumentLanguageClassifier(
languages: list[str] | None = None
)
Parameters:
languages(list[str] | None, defaultNone): A list of ISO 639-1 language codes to match against (e.g.,["en", "de", "fr"]). If not specified, defaults to["en"]. Documents whose detected language is not in this list receive a metadata value of"unmatched".
Run Method
run(documents: list[Document]) -> {"documents": list[Document]}
Parameters:
documents(list[Document], required): A list of documents for language classification.
Raises:
TypeError: If input is not a list of Document objects.
Behavior:
- Each document receives a new
languagemetadata field. - If the detected language matches one of the configured languages, the metadata is set to the ISO code (e.g.,
"en"). - If the detected language does not match, the metadata is set to
"unmatched". - If detection fails, a warning is logged and the document is marked as
"unmatched".
I/O Contract
| Direction | Name | Type | Description |
|---|---|---|---|
| Input | documents | list[Document] | Documents to classify by language |
| Output | documents | list[Document] | Documents with language metadata field added
|
Usage Examples
Basic Language Detection
from haystack import Document
from haystack.components.classifiers import DocumentLanguageClassifier
classifier = DocumentLanguageClassifier(languages=["en", "de", "fr"])
docs = [
Document(content="This is an English document"),
Document(content="Dies ist ein deutsches Dokument"),
Document(content="Ceci est un document francais")
]
result = classifier.run(documents=docs)
for doc in result["documents"]:
print(doc.meta["language"])
# Output: "en", "de", "fr"
Default English-Only Classification
from haystack import Document
from haystack.components.classifiers import DocumentLanguageClassifier
classifier = DocumentLanguageClassifier() # defaults to ["en"]
docs = [
Document(content="Hello world"),
Document(content="Hola mundo")
]
result = classifier.run(documents=docs)
# First doc: meta.language = "en"
# Second doc: meta.language = "unmatched"
Pipeline with Language-Based Routing
from haystack import Document, Pipeline
from haystack.document_stores.in_memory import InMemoryDocumentStore
from haystack.components.classifiers import DocumentLanguageClassifier
from haystack.components.routers import MetadataRouter
from haystack.components.writers import DocumentWriter
docs = [
Document(id="1", content="This is an English document"),
Document(id="2", content="Este es un documento en espanol")
]
document_store = InMemoryDocumentStore()
pipeline = Pipeline()
pipeline.add_component(
"language_classifier",
DocumentLanguageClassifier(languages=["en"])
)
pipeline.add_component(
"router",
MetadataRouter(rules={
"en": {
"field": "meta.language",
"operator": "==",
"value": "en"
}
})
)
pipeline.add_component("writer", DocumentWriter(document_store=document_store))
pipeline.connect("language_classifier.documents", "router.documents")
pipeline.connect("router.en", "writer.documents")
pipeline.run({"language_classifier": {"documents": docs}})
# Only English documents are written to the store
written_docs = document_store.filter_documents()
assert len(written_docs) == 1
Related Pages
Implements Principle
- Deepset_ai_Haystack_Document_Language_Classification - The principle behind document language classification
- Deepset_ai_Haystack_MetadataRouter - Routes documents based on metadata (including language)
- Deepset_ai_Haystack_Metadata_Based_Routing - Principle of metadata-based routing
- Deepset_ai_Haystack_DocumentCleaner - Cleans documents before or after classification
Requires Environment
Page Connections
Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment