Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Vespa engine Vespa SimpleDetector Detect

From Leeroopedia
Revision as of 17:04, 16 February 2026 by Admin (talk | contribs) (Auto-imported from implementations/Vespa_engine_Vespa_SimpleDetector_Detect.md)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)


Knowledge Sources
Domains NLP, Text_Processing
Last Updated 2026-02-09 00:00 GMT

Overview

Concrete tool for detecting the natural language of input text provided by Vespa's linguistics library. Uses Unicode block analysis to identify CJK languages and accepts optional locale hints for disambiguation.

Description

The SimpleDetector class implements the Detector interface and provides a lightweight, heuristic-based language detection mechanism. Rather than employing statistical n-gram profiling, it analyzes the Unicode blocks of characters in the input text to determine whether the text belongs to a CJK (Chinese, Japanese, Korean) language family or defaults to an unknown/hint-supplied language.

The detect method delegates to an internal guessLanguage method that iterates through the characters of the input string, examines which Unicode block each character belongs to, and returns the most likely language based on the presence of script-specific characters (e.g., Hiragana/Katakana for Japanese, Hangul for Korean, CJK Unified Ideographs for Chinese).

The detection result is wrapped in a Detection object that includes:

  • The detected Language enum value.
  • The character encoding (always UTF-8).
  • A boolean indicating whether the detection is certain (always false for this heuristic detector).

Usage

Use SimpleDetector.detect() as the first step in a text processing pipeline when you need to determine the language of incoming text to route it to language-specific processing. This detector is appropriate when:

  • The primary distinction needed is between CJK and non-CJK languages.
  • A lightweight, zero-dependency detection mechanism is required.
  • Statistical accuracy for distinguishing between closely related Latin-script languages is not needed.

For more sophisticated language detection (e.g., distinguishing French from Spanish), a statistical detector should be used instead.

Code Reference

Source Location

  • Repository: Vespa
  • File: linguistics/src/main/java/com/yahoo/language/simple/SimpleDetector.java
  • Lines: 41-43

Signature

@Override
public Detection detect(String input, Hint hint)

Class Declaration

public class SimpleDetector implements Detector

Package

package com.yahoo.language.simple;

Imports

import com.yahoo.language.Language;
import com.yahoo.language.detect.Detection;
import com.yahoo.language.detect.Detector;
import com.yahoo.language.detect.Hint;
import com.yahoo.text.Utf8;

Method Body

@Override
public Detection detect(String input, Hint hint) {
    return new Detection(guessLanguage(input), Utf8.getCharset().name(), false);
}

I/O Contract

Inputs

Name Type Required Description
input String Yes The text to analyze for language detection. Can be any length; the detector scans characters sequentially.
hint Hint Yes An optional hint providing locale or language information. May contain a locale-based language suggestion for disambiguation. Pass Hint.NONE if no hint is available.

Outputs

Name Type Description
(return value) Detection A detection result containing: the detected Language enum value, the encoding name (always "UTF-8"), and a certainty flag (always false).

Usage Examples

Basic Usage

import com.yahoo.language.simple.SimpleDetector;
import com.yahoo.language.detect.Detection;
import com.yahoo.language.detect.Hint;

SimpleDetector detector = new SimpleDetector();

// Detect language of English text
Detection result = detector.detect("Hello, world!", Hint.NONE);
// result.getLanguage() -> Language.UNKNOWN (no CJK characters detected)

// Detect language of Japanese text
Detection japaneseResult = detector.detect("こんにちは世界", Hint.NONE);
// japaneseResult.getLanguage() -> Language.JAPANESE (Hiragana detected)

Using with Locale Hints

import com.yahoo.language.simple.SimpleDetector;
import com.yahoo.language.detect.Detection;
import com.yahoo.language.detect.Hint;
import java.util.Locale;

SimpleDetector detector = new SimpleDetector();

// Provide a locale hint for ambiguous text
Hint koreanHint = Hint.newInstance(new Locale("ko"));
Detection result = detector.detect("mixed content 데이터", koreanHint);

Integration in a Pipeline

import com.yahoo.language.simple.SimpleDetector;
import com.yahoo.language.detect.Detection;
import com.yahoo.language.detect.Hint;
import com.yahoo.language.Language;

SimpleDetector detector = new SimpleDetector();

public void processDocument(String text) {
    Detection detection = detector.detect(text, Hint.NONE);
    Language language = detection.getLanguage();

    if (language.isCjk()) {
        // Use CJK-specific tokenizer (character-based segmentation)
        tokenizeCjk(text, language);
    } else {
        // Use standard whitespace-based tokenizer
        tokenizeLatinScript(text, language);
    }
}

Related Pages

Implements Principle

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment