Principle:Vespa engine Vespa Linguistics Span Tree Cleanup
| Knowledge Sources | |
|---|---|
| Domains | Document_Processing, Indexing |
| Last Updated | 2026-02-09 00:00 GMT |
Overview
Linguistics span tree cleanup is the process of removing stale linguistic annotations from document field values before re-indexing, ensuring that fresh processing is not contaminated by outdated analysis results.
Description
When a document is processed through an indexing pipeline, linguistic analysis produces annotations that are attached to text fields as span trees. These annotations may include tokenization boundaries, stemming results, normalization forms, and language detection metadata. The annotations are stored alongside the field values as structured overlays called span trees.
When a document is re-indexed (for example, after a schema change or a re-processing trigger), the existing linguistic span trees must be removed before new analysis is applied. If stale annotations are left in place, the new linguistic processing may produce conflicting or duplicated results, leading to incorrect search behavior.
The cleanup process must handle the full range of field value types that can contain text:
- String fields: The simplest case. The linguistics span tree is directly attached to the string field value and must be removed.
- Array fields: Each element of the array may be a string (or a nested complex type containing strings). The cleanup must iterate over all elements recursively.
- Weighted set fields: The keys of the weighted set may be string field values with attached span trees. Each key must be cleaned.
- Map fields: Both keys and values in a map may contain string fields with linguistics annotations. Both must be traversed.
- Structured fields (structs): A struct may contain any combination of the above field types. The cleanup must iterate over all fields of the struct and recurse into each value.
This recursive traversal ensures that no matter how deeply nested a text field is within a complex document structure, its stale linguistics annotations are removed before re-processing.
Usage
Linguistics span tree cleanup should be applied immediately before executing indexing expressions on a document. It is a mandatory pre-processing step for any re-indexing operation.
Use this pattern when:
- You are re-indexing documents that may already carry linguistic annotations from a previous indexing pass.
- You are changing the linguistic processing configuration (such as switching stemming algorithms or language settings) and need to ensure old annotations do not persist.
- You are implementing a document processing pipeline that applies linguistic analysis and needs to guarantee idempotent behavior on repeated application.
Theoretical Basis
The cleanup algorithm is a recursive type-dispatched traversal over the field value type hierarchy. It uses the visitor pattern implicitly, dispatching to the appropriate cleanup logic based on the runtime type of each field value.
The traversal can be described as follows:
function removeSpanTrees(value):
if value is StringFieldValue:
value.removeSpanTree("linguistics")
else if value is Array:
for each element in value:
removeSpanTrees(element)
else if value is WeightedSet:
for each key in value.keys:
removeSpanTrees(key)
else if value is Map:
for each (key, val) in value.entries:
removeSpanTrees(key)
removeSpanTrees(val)
else if value is Struct:
for each (field, val) in value.fields:
removeSpanTrees(val)
The recursion terminates naturally because field value types form a finite tree: primitive types (strings, numbers) are leaves, while collection and structured types are internal nodes. Every path through the type hierarchy eventually reaches a leaf.
Idempotency: The cleanup operation is idempotent. Removing a span tree that does not exist is a no-op. This means the cleanup can be safely applied multiple times without side effects, which simplifies error recovery in the processing pipeline.
Selective removal: Only the linguistics-specific span tree is removed. Other span trees (if any) that may be attached to field values are preserved. This ensures that non-linguistic annotations survive the re-indexing process.