Principle:Vespa engine Vespa Indexing Expression Execution
| Knowledge Sources | |
|---|---|
| Domains | Document_Processing, Indexing |
| Last Updated | 2026-02-09 00:00 GMT |
Overview
Indexing expression execution is the process of applying schema-defined field transformation rules to a document, converting raw input fields into indexed, attributed, and summarized representations suitable for search and retrieval.
Description
A search engine's schema defines not just the structure of documents but also how each field should be processed for indexing. These processing rules are expressed as indexing expressions -- a domain-specific language that specifies transformations such as tokenization, normalization, linguistic analysis, embedding generation, and attribute population.
At indexing time, these expressions are compiled into an executable form (a script expression tree) and applied to each incoming document. The execution process involves three phases:
Phase 1: Field Validation
Before any transformation is applied, each field in the document is validated against the document type definition. This ensures that the document does not contain undeclared fields that could cause unexpected behavior during expression execution. Fields that are not declared in the schema are rejected with an error.
Phase 2: Annotation Cleanup
Any existing linguistic annotations (span trees) from previous indexing passes are removed. This is essential for re-indexing scenarios where a document has already been processed and carries stale annotations. Without this cleanup, the new linguistic analysis would conflict with the old annotations.
Phase 3: Expression Execution
The compiled expression tree is executed against the document. The expression tree is a directed acyclic graph of operations that read input fields, apply transformations, and write results to output fields. Common operations include:
- Tokenization: Breaking text fields into individual tokens for full-text search.
- Normalization: Applying case folding, accent removal, and other text normalization.
- Stemming: Reducing tokens to their root forms for language-independent matching.
- Attribute population: Copying field values into in-memory attribute stores for filtering and sorting.
- Summary generation: Preparing field values for inclusion in search result summaries.
- Embedding generation: Computing vector embeddings for semantic search and nearest-neighbor retrieval.
- Chunking: Splitting long text fields into smaller segments for embedding or other processing.
The expression execution is deadline-aware: if a processing deadline is provided, long-running operations (such as embedding generation) can check the deadline and abort with a timeout rather than blocking indefinitely.
The execution may also differ based on whether the document is being indexed for the first time or being re-indexed. The isReindexing flag allows expressions to optimize their behavior -- for example, skipping certain validations or applying incremental updates rather than full recomputation.
Usage
Indexing expression execution is the core transformation step in the document indexing pipeline. It is applied after document reception and script resolution, and before the document is forwarded to the content layer for storage.
Use this pattern when:
- You are building a schema-driven document processing system where field transformations are defined declaratively.
- You need to support a rich set of field-level operations including linguistic analysis, embedding generation, and attribute population.
- You want transformation logic to be configurable through schema definitions rather than hardcoded in the processing pipeline.
- You need deadline-aware processing to prevent unbounded execution time.
Theoretical Basis
Indexing expression execution follows the interpreter pattern applied to a domain-specific language. The indexing expressions defined in the schema are parsed into an abstract syntax tree (AST) and then executed by an interpreter that traverses the tree.
The execution model can be described as:
function executeIndexing(document, isReindexing, deadline):
// Phase 1: Validate all fields
for each (field, value) in document:
if field not declared in documentType:
raise InvalidFieldError(field)
// Phase 2: Clean stale annotations
for each (field, value) in document:
removeStaleAnnotations(value)
// Phase 3: Execute expression tree
context = createExecutionContext(document, isReindexing, deadline)
expressionTree.execute(context)
return context.outputDocument
The expression tree is a composition of primitive operations:
| Operation Type | Input | Output | Description |
|---|---|---|---|
| Input | Field name | Field value | Reads a field value from the document |
| Tokenize | String value | Token list | Breaks text into searchable tokens |
| Normalize | String value | String value | Applies text normalization rules |
| Stem | Token list | Token list | Reduces tokens to root forms |
| Embed | Text value | Vector value | Generates dense vector embeddings |
| Attribute | Field value | Attribute store | Writes value to in-memory attribute |
| Summary | Field value | Summary store | Writes value to summary field |
| Output | Field value | Document field | Writes a value to the output document |
These primitives are composed using sequential execution (pipeline), conditional branching, and parallel fan-out to form complex transformation graphs that match the schema's indexing declarations.