Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:Vespa engine Vespa FRT Transport and Failover

From Leeroopedia


Metadata
Sources Vespa
Domains Configuration, Distributed_Systems
Last Updated 2026-02-09 12:00 GMT

Overview

FRT (Fast Reliable Transport) is Vespa's custom RPC protocol for configuration distribution. The connection pool manages connections to multiple config servers with round-robin selection and automatic failover, while the config agent processes responses and updates local config holders.

Description

Vespa's configuration system uses a purpose-built RPC protocol called FRT (Fast Reliable Transport), built on top of the FNET transport layer, to distribute configuration from config servers to subscribing processes. The FRT subsystem consists of two key components: the connection pool for managing server connectivity, and the config agent for processing configuration responses.

Connection Pool and Failover

The FRTConnectionPool manages connections to one or more config servers (or config proxies). It provides automatic failover when a config server becomes unavailable by maintaining a pool of connections and selecting among them using one of two strategies:

Round-Robin Selection: When the local hostname is not set, connections are selected in a round-robin fashion from the pool of ready (non-suspended) sources. If all sources are suspended (experiencing errors), the round-robin cycles through suspended sources as well. This provides load distribution across config servers.

Hash-Based Selection: When the local hostname is set, a deterministic hash of the hostname selects the preferred config server. This ensures that a given process consistently connects to the same config server, providing locality. The hash function is designed to be compatible with the Java implementation (using Java's String.hashCode() algorithm) for cross-language consistency.

Each connection tracks its error state and can be suspended for a configurable duration after errors. The connection pool separates connections into ready and suspended lists, preferring ready connections but falling back to suspended ones when no healthy connections are available.

Config Agent and Response Processing

The FRTConfigAgent processes configuration responses received from the config server. It implements a state machine for tracking the configuration state:

OK Responses: When a valid response arrives, the agent compares the new config state (generation + content hash) against the current state. If the state has changed, it creates a ConfigUpdate and pushes it to the config holder for the subscription to consume. The wait time and timeout are reset to their success values.

Error Responses: When an error or timeout occurs, the agent increments a failure counter and applies exponential backoff. The delay between retries increases with a multiplier up to maxDelayMultiplier. The timeout is set to the error timeout value. The agent differentiates between errors that occur before the process has received its first config (unconfigured) and errors after (configured), using different delay values for each case.

Content Hashing: The agent uses xxhash64 to efficiently detect whether config content has changed between generations. Only actual content changes trigger the changed flag on updates, even though new generations are always propagated.

Usage

This principle applies whenever Vespa processes communicate with config servers to fetch configuration. It is relevant to understanding:

  • How config subscriptions are fulfilled at the network level
  • How the system handles config server failures and network partitions
  • How config responses are processed and made available to subscriptions
  • The timing and retry semantics of config distribution

Theoretical Basis

The FRT transport and failover system implements several well-known distributed systems patterns:

Connection Pooling with Health Tracking: The connection pool implements a variant of the circuit breaker pattern. When a connection encounters errors, it is suspended (circuit opens). After a delay, it becomes available again (circuit half-opens). A successful request returns it to normal operation (circuit closes).

Exponential Backoff: Error handling uses bounded exponential backoff with the formula:

waitTime = fixedDelay + (multiplier * errorDelay)

where multiplier = min(failedRequests, maxDelayMultiplier). This prevents overwhelming a struggling config server while ensuring bounded recovery time.

Long Polling: Config requests use a long poll pattern where the server holds the request until either new config is available or the timeout expires. This provides near-real-time notification while minimizing network traffic. The timeout values differ based on context:

  • successTimeout: Used after a successful request (longer, since we expect to wait)
  • errorTimeout: Used after an error (shorter, to recover quickly)
  • initialTimeout: Used for the first request (moderate)

Consistent Hashing for Locality: The hash-based connection selection uses a deterministic hash to provide client affinity -- the same client always connects to the same server. This enables the server to optimize by caching per-client state.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment