Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Workflow:Vespa engine Vespa Linguistics text processing pipeline

From Leeroopedia


Knowledge Sources
Domains NLP, Text_Processing, Search
Last Updated 2026-02-09 12:00 GMT

Overview

End-to-end process for transforming raw text into indexed tokens and vector embeddings using Vespa's built-in linguistics pipeline.

Description

This workflow outlines how Vespa processes text input through its linguistics subsystem. The pipeline takes raw text and produces tokenized, normalized, and optionally stemmed output suitable for indexing and search. It covers language detection, Unicode normalization, tokenization at character-type boundaries, accent removal, stemming via the KStem algorithm, and optional embedding generation for vector search. The pipeline is implemented as a factory pattern where the Linguistics interface provides thread-safe access to all processing components.

Usage

Execute this workflow when ingesting text content into Vespa that requires linguistic processing for search relevance. This applies when you have raw text documents (articles, product descriptions, user queries) and need to produce properly tokenized, normalized tokens for keyword search or dense vector embeddings for semantic search. The pipeline handles over 200 languages with CJK-aware tokenization and English stemming.

Execution Steps

Step 1: Language detection

Analyze the input text to determine its language and encoding. The detector examines Unicode character blocks to identify CJK languages (Korean, Japanese, Chinese, Thai) and falls back to encoding analysis (UTF-8, US-ASCII, ISO-8859-1) for other scripts. The result guides downstream processing decisions such as CJK-specific tokenization and language-appropriate stemming.

Key considerations:

  • Korean is identified by Hangul Unicode blocks (0x3200-0xFFE0)
  • Japanese is identified by Hiragana, Katakana, and Kanbun blocks
  • Chinese is detected via CJK Unified Ideographs blocks
  • For ambiguous input, hints (such as market locale) can guide detection

Step 2: Text chunking

For large text inputs that exceed embedding model capacity, split the text into manageable chunks at natural boundaries. The chunker targets a configurable length (default 1000 characters) and prefers breaking at punctuation or whitespace. CJK text is split precisely at the target length since word boundaries are less meaningful. The chunker dynamically adjusts target length to ensure even distribution across chunks.

Key considerations:

  • Soft limit at 5% above target prefers double non-letter/digit boundaries
  • Hard limit at 10% above target falls back to single non-letter/digit boundaries
  • CJK languages receive special handling with precise character-level splitting
  • Chunking results are cached for repeated access

Step 3: Unicode normalization

Apply NFKC (Normalization Form Compatibility Composition) Unicode normalization to standardize text representation. This converts compatibility characters to their canonical forms, ensuring consistent token matching regardless of input encoding variations.

Key considerations:

  • NFKC handles ligature decomposition and width normalization
  • CJK compatibility ideographs are normalized to unified forms
  • This step runs before tokenization to ensure consistent splitting

Step 4: Tokenization

Split the normalized text into individual tokens at whitespace and character-type boundaries. The tokenizer tracks each token's type (alphabetic, numeric, symbol, punctuation) and script (Latin, CJK, Hangul). Indexable symbols are always separated into individual tokens. Transitions between indexable and non-indexable characters define token boundaries.

Key considerations:

  • Each token carries its original text, offset position, type, and script
  • Special tokens (configured via SpecialTokenRegistry) are recognized as single tokens
  • Character classification uses Unicode-aware rules beyond basic Java categories
  • The tokenizer is NOT thread-safe and must be instantiated per thread

Step 5: Text transformation

Apply optional transformations to each token based on the configured linguistics parameters. This includes lowercasing (using Vespa's locale-aware case conversion) and accent/diacritic removal (stripping combining diacritical marks after NFD decomposition). These transformations improve recall by matching variant forms of the same word.

Key considerations:

  • Lowercasing is locale-sensitive for correct behavior with Turkish dotted-I and similar cases
  • Accent removal uses Unicode decomposition followed by stripping Mark characters
  • Both transformations are optional and controlled by LinguisticsParameters

Step 6: Stemming

Reduce inflected words to their root form using the KStem (Krovetz) algorithm. The stemmer handles regular English morphology (plurals, verb tenses, derivational suffixes) and maintains exception lists for irregular forms. Direct conflation rules handle common mappings like "aging" to "age" and "lying" to "lie".

Key considerations:

  • KStem is conservative: it only stems when confident the result is a valid word
  • Maximum word length is 50 characters; longer words are returned unchanged
  • The stemmer uses a dictionary of English words organized across 8 data files
  • For non-English text, stemming may be skipped depending on language support

Step 7: Embedding generation

Optionally convert processed text into dense vector representations (tensors) for semantic search. The Embedder interface accepts text and produces either token IDs or full tensor embeddings suitable for approximate nearest neighbor search. Multiple embedder implementations can coexist, selected by embedder ID.

Key considerations:

  • Embedders report latency, sequence length, and request count metrics
  • The embedding context carries document type, field name, and language information
  • Batch embedding of multiple text strings is supported for efficiency
  • Token decoding (tensor back to text) is available for debugging

Execution Diagram

GitHub URL

Workflow Repository