Principle:Deepset ai Haystack Document Language Classification
Overview
Document Language Classification is the principle of automatically detecting the natural language of document content and annotating each document with its detected language. This enables language-aware processing in multilingual pipelines, where documents in different languages need to be routed to language-specific processing branches (e.g., different embedding models, different prompt templates, or language-specific document stores).
Description
In multilingual document processing systems, knowing the language of each document is essential for correct downstream handling. Language detection classifies the language of a document's text content by analyzing statistical properties of the character and word distributions.
The classification process works as follows:
- Text analysis: The content of each document is analyzed using statistical language detection. The detection algorithm examines character n-gram frequency profiles and compares them against known language profiles.
- Language matching: The detected language is compared against a configured list of expected languages (specified as ISO 639-1 codes such as
"en","de","fr"). - Metadata annotation: If the detected language matches one of the configured languages, a
languagefield is added to the document's metadata with the ISO code. If no match is found, the metadata value is set to"unmatched". - Downstream routing: After classification, a metadata-based router can direct documents to language-specific processing branches based on the
languagemetadata field.
Key Properties
- Non-destructive: The classifier does not modify document content; it only adds metadata.
- Configurable language set: Only documents matching specified languages are tagged; all others are marked as unmatched.
- Graceful error handling: If language detection fails (e.g., for very short or ambiguous text), the classifier logs a warning and continues processing.
- Pipeline integration: Designed to work in conjunction with MetadataRouter for language-based routing.
Usage
Document Language Classification is typically used at the beginning of a multilingual processing pipeline, immediately before a metadata-based router. The classifier annotates each document, and the router then directs documents to language-specific branches.
[Documents] --> [DocumentLanguageClassifier] --> [MetadataRouter] --en--> [EnglishProcessor]
--de--> [GermanProcessor]
--unmatched--> [FallbackProcessor]
Theoretical Basis
Language detection relies on statistical language identification techniques. The most common approach uses character n-gram frequency profiles, where the frequency distribution of character sequences (bigrams, trigrams) in a text is compared against reference profiles for each known language. The language whose profile is most similar to the input text (measured by rank-order distance or similar metrics) is selected as the detected language.
The langdetect library used in the implementation is based on Nakatani Shuyo's work, which uses a Naive Bayes classifier trained on character n-gram features from Wikipedia text. It supports over 50 languages and achieves high accuracy on texts of moderate length (typically 50+ characters).
Key considerations:
- Short text challenge: Very short texts (a few words) may not contain enough statistical signal for reliable detection.
- Mixed language content: Documents containing multiple languages may be classified as whichever language is most prevalent.
- Script-based disambiguation: Languages using unique scripts (e.g., Chinese, Korean, Arabic) are easier to distinguish than languages sharing the Latin alphabet.
Related Pages
- Deepset_ai_Haystack_DocumentLanguageClassifier - Implementation of Document Language Classification in Haystack
- Deepset_ai_Haystack_Metadata_Based_Routing - Routing documents based on metadata fields (including language)
- Deepset_ai_Haystack_MetadataRouter - MetadataRouter component used for language-based routing
- Deepset_ai_Haystack_Document_Cleaning - Cleaning documents before or after classification