Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Heuristic:Vespa engine Vespa Document Batch Processing Strategy

From Leeroopedia



Knowledge Sources
Domains Optimization, Document_Processing
Last Updated 2026-02-09 00:00 GMT

Overview

Document processing strategy using atomic batch replacement, deadline propagation, three-tier error classification, and 64KB serialization buffers with 2x growth for type conversion.

Description

The Vespa IndexingProcessor implements a batch-first processing strategy: all document operations are processed into a separate list, then the original list is atomically replaced. This prevents partial state if errors occur mid-batch. Deadlines are propagated through the entire pipeline to prevent long-running operations from blocking. Errors are classified into three tiers (invalid input, overload, timeout) enabling proper retry strategies at higher levels. When document type conversion is needed, a 64KB buffer with 2.0x growth factor is used for serialization/deserialization.

Usage

Apply this heuristic when working with the IndexingProcessor implementation or designing similar batch document processing systems. The three-tier error classification pattern is particularly useful for any request processing pipeline that needs differentiated retry behavior.

The Insight (Rule of Thumb)

  • Action 1: Process all operations into a separate output list, then atomically replace the original.
  • Action 2: Propagate deadlines through all downstream operations using proc.timeLeft().
  • Action 3: Classify errors into three tiers:
    • InvalidInputException -> Progress.INVALID_INPUT (client error, don't retry)
    • OverloadException -> Progress.OVERLOAD (system overload, back off)
    • TimeoutException -> Progress.TIMEOUT (timeout, retry later)
  • Action 4: For document type conversion, use 64KB initial buffer with 2.0x growth: new GrowableByteBuffer(64 * 1024, 2.0f).
  • Action 5: Remove linguistics span trees recursively from all nested collection types before re-indexing.
  • Trade-off: Batch-first approach uses more temporary memory but provides atomicity guarantees.

Reasoning

Atomic batch replacement ensures that if any operation fails, the caller sees either all operations processed or none (depending on where the error occurs). This is critical for maintaining document consistency. The deadline propagation prevents one slow document from starving the entire batch. The three-tier error classification allows the caller to make informed decisions: invalid input should not be retried, overload should trigger backoff, and timeout should trigger retry. The 64KB buffer with 2x growth is tuned for typical document sizes while minimizing array copies during serialization.

The span tree removal is necessary because stale linguistic annotations from previous indexing would interfere with new annotations. The removal must be recursive because documents can have nested structures (arrays of structs containing string fields with span trees).

Code Evidence

Batch processing from IndexingProcessor.java:92-114:

public Progress process(Processing proc) {
    if (proc.getDocumentOperations().isEmpty()) return Progress.DONE;

    List<DocumentOperation> out = new ArrayList<>(proc.getDocumentOperations().size());
    for (var op : proc.getDocumentOperations()) {
        // Process each operation into output list
    }
    proc.getDocumentOperations().clear();
    proc.getDocumentOperations().addAll(out);  // Atomic replacement
    return Progress.DONE;
}

Deadline propagation from IndexingProcessor.java:96-100:

Instant deadline = null;
var timeLeft = proc.timeLeft();
if (timeLeft != Processing.NO_TIMEOUT) {
    deadline = Instant.now().plus(timeLeft);
}

Three-tier error handling from IndexingProcessor.java:116-134:

} catch (InvalidInputException e) {
    return Progress.INVALID_INPUT.withReason(...);
} catch (OverloadException e) {
    return Progress.OVERLOAD.withReason(...);
} catch (TimeoutException e) {
    return Progress.TIMEOUT.withReason(...);
}

Buffer sizing from IndexingProcessor.java:158:

GrowableByteBuffer buffer = new GrowableByteBuffer(64 * 1024, 2.0f);

Recursive span tree removal from DocumentScript.java:87-108:

private void removeAnyLinguisticsSpanTree(FieldValue value) {
    if (value instanceof StringFieldValue) {
        ((StringFieldValue)value).removeSpanTree(SpanTrees.LINGUISTICS);
    } else if (value instanceof Array<?> arr) {
        for (FieldValue fv : arr.getValues()) {
            removeAnyLinguisticsSpanTree(fv);  // Recursive
        }
    }
    // ... similar recursion for WeightedSet, Map, Struct
}

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment