Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:Vespa engine Vespa Indexing Expression Execution

From Leeroopedia


Knowledge Sources
Domains Document_Processing, Indexing
Last Updated 2026-02-09 00:00 GMT

Overview

Indexing expression execution is the process of applying schema-defined field transformation rules to a document, converting raw input fields into indexed, attributed, and summarized representations suitable for search and retrieval.

Description

A search engine's schema defines not just the structure of documents but also how each field should be processed for indexing. These processing rules are expressed as indexing expressions -- a domain-specific language that specifies transformations such as tokenization, normalization, linguistic analysis, embedding generation, and attribute population.

At indexing time, these expressions are compiled into an executable form (a script expression tree) and applied to each incoming document. The execution process involves three phases:

Phase 1: Field Validation

Before any transformation is applied, each field in the document is validated against the document type definition. This ensures that the document does not contain undeclared fields that could cause unexpected behavior during expression execution. Fields that are not declared in the schema are rejected with an error.

Phase 2: Annotation Cleanup

Any existing linguistic annotations (span trees) from previous indexing passes are removed. This is essential for re-indexing scenarios where a document has already been processed and carries stale annotations. Without this cleanup, the new linguistic analysis would conflict with the old annotations.

Phase 3: Expression Execution

The compiled expression tree is executed against the document. The expression tree is a directed acyclic graph of operations that read input fields, apply transformations, and write results to output fields. Common operations include:

  • Tokenization: Breaking text fields into individual tokens for full-text search.
  • Normalization: Applying case folding, accent removal, and other text normalization.
  • Stemming: Reducing tokens to their root forms for language-independent matching.
  • Attribute population: Copying field values into in-memory attribute stores for filtering and sorting.
  • Summary generation: Preparing field values for inclusion in search result summaries.
  • Embedding generation: Computing vector embeddings for semantic search and nearest-neighbor retrieval.
  • Chunking: Splitting long text fields into smaller segments for embedding or other processing.

The expression execution is deadline-aware: if a processing deadline is provided, long-running operations (such as embedding generation) can check the deadline and abort with a timeout rather than blocking indefinitely.

The execution may also differ based on whether the document is being indexed for the first time or being re-indexed. The isReindexing flag allows expressions to optimize their behavior -- for example, skipping certain validations or applying incremental updates rather than full recomputation.

Usage

Indexing expression execution is the core transformation step in the document indexing pipeline. It is applied after document reception and script resolution, and before the document is forwarded to the content layer for storage.

Use this pattern when:

  • You are building a schema-driven document processing system where field transformations are defined declaratively.
  • You need to support a rich set of field-level operations including linguistic analysis, embedding generation, and attribute population.
  • You want transformation logic to be configurable through schema definitions rather than hardcoded in the processing pipeline.
  • You need deadline-aware processing to prevent unbounded execution time.

Theoretical Basis

Indexing expression execution follows the interpreter pattern applied to a domain-specific language. The indexing expressions defined in the schema are parsed into an abstract syntax tree (AST) and then executed by an interpreter that traverses the tree.

The execution model can be described as:

function executeIndexing(document, isReindexing, deadline):
    // Phase 1: Validate all fields
    for each (field, value) in document:
        if field not declared in documentType:
            raise InvalidFieldError(field)

    // Phase 2: Clean stale annotations
    for each (field, value) in document:
        removeStaleAnnotations(value)

    // Phase 3: Execute expression tree
    context = createExecutionContext(document, isReindexing, deadline)
    expressionTree.execute(context)

    return context.outputDocument

The expression tree is a composition of primitive operations:

Operation Type Input Output Description
Input Field name Field value Reads a field value from the document
Tokenize String value Token list Breaks text into searchable tokens
Normalize String value String value Applies text normalization rules
Stem Token list Token list Reduces tokens to root forms
Embed Text value Vector value Generates dense vector embeddings
Attribute Field value Attribute store Writes value to in-memory attribute
Summary Field value Summary store Writes value to summary field
Output Field value Document field Writes a value to the output document

These primitives are composed using sequential execution (pipeline), conditional branching, and parallel fan-out to form complex transformation graphs that match the schema's indexing declarations.

Related Pages

Implemented By

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment