Heuristic:OpenHands OpenHands Clustered Race Condition Prevention
| Knowledge Sources | |
|---|---|
| Domains | Distributed_Systems, Conversation_Management |
| Last Updated | 2026-02-11 21:00 GMT |
Overview
Increment `max_concurrent_conversations` by 1 in clustered mode to prevent race conditions when multiple servers simultaneously start conversations.
Description
In the OpenHands clustered conversation manager, the system increments the maximum concurrent conversation limit by 1 during initialization. This counterintuitive adjustment compensates for a race condition inherent in the distributed conversation startup flow: a server first registers the conversation in Redis (consuming a slot), then counts the total running conversations to check if the limit is exceeded. Without the +1 adjustment, the newly registered conversation would count against itself, causing the limit check to incorrectly reject valid conversation starts. This pattern is a form of optimistic concurrency control that works with Redis atomic operations.
Usage
Apply this pattern specifically in the ClusteredConversationManager initialization. This is relevant whenever you need to implement a register-then-check pattern in a distributed system where the registration itself must be counted.
The Insight (Rule of Thumb)
- Action: Add 1 to `max_concurrent_conversations` in `__post_init__` of `ClusteredConversationManager`.
- Value: Exactly +1 to the configured maximum.
- Trade-off: Allows one extra conversation above the intended limit in a theoretical worst case. In practice, the atomic Redis operations make this extremely unlikely, and the +1 ensures legitimate conversation starts are never rejected by their own registration.
Reasoning
The race condition occurs because the clustered flow is:
- Server A calls `redis.set(conversation_key, 1, nx=True)` to claim the conversation (registration).
- Server A counts all active conversation keys to check if the limit is exceeded.
- The newly registered key is included in the count, making it appear that the limit is already reached.
By incrementing the limit by 1, the server compensates for its own registration being included in the count. This is simpler and more reliable than the alternative approaches:
- Alternative 1: Count first, then register. This has a TOCTOU (time-of-check-time-of-use) race where another server could register between count and registration.
- Alternative 2: Exclude own key from count. This requires knowing the key name at count time and adds complexity.
- Alternative 3: Use a Lua script for atomic check-and-set. This adds complexity and couples the logic to Redis scripting.
Code evidence from `enterprise/server/clustered_conversation_manager.py:83-88`:
def __post_init__(self):
# We increment the max_concurrent_conversations by 1 because this class
# marks the conversation as started in Redis before checking the number
# of running conversations. This prevents race conditions where multiple
# servers might simultaneously start new conversations.
self.config.max_concurrent_conversations += 1