Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:OpenHands OpenHands Distributed State Management

From Leeroopedia
Knowledge Sources
Domains Distributed_Systems, Conversation_Management
Last Updated 2026-02-11 21:00 GMT

Overview

Distributed state management is the coordination of conversation state across multiple server instances in a cluster using a publish-subscribe messaging pattern backed by a shared data store.

Description

In a horizontally-scaled deployment, multiple server nodes may need to be aware of the same conversation's state. For example, when a user's WebSocket connection is routed to Node A but the agent loop for that conversation is running on Node B, Node A must receive state updates from Node B to relay them to the user's browser. Without a coordination mechanism, each node would only know about conversations it directly owns, creating an inconsistent view of the system.

Distributed state management solves this through:

  • Pub/sub messaging -- Nodes publish state change events (conversation started, status changed, conversation stopped) to a shared message bus. All nodes subscribe to these events and update their local state accordingly.
  • Shared state store -- A centralized data store (typically Redis) holds the canonical state of all conversations. Nodes read from this store to reconcile their local state on startup or after missed messages.
  • Local state cache -- Each node maintains a local in-memory cache of conversation states for fast lookups, updated by incoming pub/sub messages and periodic reconciliation with the shared store.

The combination of pub/sub for real-time updates and a shared store for durability ensures that all nodes converge on the same state view, even after transient network partitions or node restarts.

Usage

Use distributed state management whenever:

  • The system runs on more than one server node and conversations may be accessed from any node.
  • WebSocket connections and agent loops may reside on different nodes.
  • The system must survive node failures and redistribute conversation ownership.

Theoretical Basis

The pattern combines publish-subscribe messaging with eventual consistency from distributed systems theory.

Pseudocode:

# Subscriber side (runs on every node)
function redis_subscribe():
    channel = "conversation_state_updates"
    subscription = redis.subscribe(channel)

    async for message in subscription:
        process_message(message)


function process_message(message):
    event_type = message["type"]
    conversation_id = message["conversation_id"]
    source_node = message["source_node"]

    if source_node == self_node_id:
        # Ignore messages from ourselves
        return

    if event_type == "status_changed":
        update_local_state(conversation_id, message["new_status"])
        notify_connected_clients(conversation_id, message["new_status"])

    elif event_type == "conversation_stopped":
        remove_from_local_state(conversation_id)
        disconnect_clients(conversation_id)

    elif event_type == "conversation_started":
        add_to_local_state(conversation_id, message["initial_status"])


# Publisher side (called when local state changes)
function update_state_in_redis(conversation_id, new_status):
    # Update the shared store
    redis.hset("conversation_states", conversation_id, serialize(new_status))

    # Publish notification
    redis.publish("conversation_state_updates", {
        "type": "status_changed",
        "conversation_id": conversation_id,
        "new_status": new_status,
        "source_node": self_node_id,
    })

Key invariants:

  • Self-message filtering -- Nodes must ignore pub/sub messages they published themselves to avoid feedback loops.
  • Idempotent processing -- Message handlers must be idempotent because pub/sub does not guarantee exactly-once delivery.
  • Reconciliation -- Periodic reconciliation with the shared store catches any messages lost during transient disconnections.

Related Pages

Implemented By

Heuristics

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment