Principle:OpenHands OpenHands Distributed State Management
| Knowledge Sources | |
|---|---|
| Domains | Distributed_Systems, Conversation_Management |
| Last Updated | 2026-02-11 21:00 GMT |
Overview
Distributed state management is the coordination of conversation state across multiple server instances in a cluster using a publish-subscribe messaging pattern backed by a shared data store.
Description
In a horizontally-scaled deployment, multiple server nodes may need to be aware of the same conversation's state. For example, when a user's WebSocket connection is routed to Node A but the agent loop for that conversation is running on Node B, Node A must receive state updates from Node B to relay them to the user's browser. Without a coordination mechanism, each node would only know about conversations it directly owns, creating an inconsistent view of the system.
Distributed state management solves this through:
- Pub/sub messaging -- Nodes publish state change events (conversation started, status changed, conversation stopped) to a shared message bus. All nodes subscribe to these events and update their local state accordingly.
- Shared state store -- A centralized data store (typically Redis) holds the canonical state of all conversations. Nodes read from this store to reconcile their local state on startup or after missed messages.
- Local state cache -- Each node maintains a local in-memory cache of conversation states for fast lookups, updated by incoming pub/sub messages and periodic reconciliation with the shared store.
The combination of pub/sub for real-time updates and a shared store for durability ensures that all nodes converge on the same state view, even after transient network partitions or node restarts.
Usage
Use distributed state management whenever:
- The system runs on more than one server node and conversations may be accessed from any node.
- WebSocket connections and agent loops may reside on different nodes.
- The system must survive node failures and redistribute conversation ownership.
Theoretical Basis
The pattern combines publish-subscribe messaging with eventual consistency from distributed systems theory.
Pseudocode:
# Subscriber side (runs on every node)
function redis_subscribe():
channel = "conversation_state_updates"
subscription = redis.subscribe(channel)
async for message in subscription:
process_message(message)
function process_message(message):
event_type = message["type"]
conversation_id = message["conversation_id"]
source_node = message["source_node"]
if source_node == self_node_id:
# Ignore messages from ourselves
return
if event_type == "status_changed":
update_local_state(conversation_id, message["new_status"])
notify_connected_clients(conversation_id, message["new_status"])
elif event_type == "conversation_stopped":
remove_from_local_state(conversation_id)
disconnect_clients(conversation_id)
elif event_type == "conversation_started":
add_to_local_state(conversation_id, message["initial_status"])
# Publisher side (called when local state changes)
function update_state_in_redis(conversation_id, new_status):
# Update the shared store
redis.hset("conversation_states", conversation_id, serialize(new_status))
# Publish notification
redis.publish("conversation_state_updates", {
"type": "status_changed",
"conversation_id": conversation_id,
"new_status": new_status,
"source_node": self_node_id,
})
Key invariants:
- Self-message filtering -- Nodes must ignore pub/sub messages they published themselves to avoid feedback loops.
- Idempotent processing -- Message handlers must be idempotent because pub/sub does not guarantee exactly-once delivery.
- Reconciliation -- Periodic reconciliation with the shared store catches any messages lost during transient disconnections.