Heuristic:OpenHands OpenHands Fail Open Rate Limiting
| Knowledge Sources | |
|---|---|
| Domains | SaaS_Infrastructure, Distributed_Systems |
| Last Updated | 2026-02-11 21:00 GMT |
Overview
Rate limiting implementation that fails open (allows requests) when Redis is unavailable, prioritizing availability of legitimate users over strict abuse prevention.
Description
OpenHands implements a fail-open rate limiting strategy using Redis as the backing store. When Redis is unavailable or an error occurs during the rate limit check, the system allows the request to proceed rather than blocking it. This is a deliberate architectural decision: blocking legitimate users due to an infrastructure failure (Redis outage) is considered worse than temporarily allowing potential abuse. The rate limiter uses Redis atomic set-with-expiry (`SET key value NX EX ttl`) for both checking and setting rate limits in a single operation, providing race-condition-free enforcement across distributed server instances.
Usage
Apply this pattern when implementing any rate limiting in the OpenHands SaaS server. It is particularly important for authentication flows (email verification, signup) where blocking a legitimate user during a Redis outage would be a poor user experience. The fail-open pattern should not be used for security-critical operations where allowing unauthenticated access would be dangerous.
The Insight (Rule of Thumb)
- Action: Wrap all Redis rate limit checks in try/except. On Redis failure, log a warning and allow the request.
- Value: Default rate limits are 120 seconds per user ID and 300 seconds per IP address.
- Trade-off: During Redis outages, rate limiting is effectively disabled. This accepts temporary abuse risk in exchange for zero downtime for legitimate users.
- Fallback Strategy: When user ID is unavailable, fall back to IP-based rate limiting with a longer window (300s vs 120s) to account for shared IPs.
Reasoning
The fail-open strategy is justified by several observations:
- Redis outages are rare: In a well-maintained production environment, Redis downtime is uncommon and typically brief.
- Abuse during outages is unlikely: Attackers are unlikely to time their attacks precisely with Redis outages.
- User impact is disproportionate: Blocking a legitimate user during signup or email verification creates a terrible user experience that may cause them to abandon the product entirely.
- Compensating controls exist: Other security layers (reCAPTCHA, Keycloak rate limits, infrastructure WAFs) provide defense even when application-level rate limiting is unavailable.
- Atomic Redis operations prevent race conditions: Using `NX` (set if not exists) with `EX` (expiry) ensures that the check-and-set is atomic, preventing race conditions in distributed deployments where multiple server instances check the same key simultaneously.
Code evidence from `enterprise/server/utils/rate_limit_utils.py:34-39`:
redis = sio.manager.redis
if not redis:
# If Redis is unavailable, log warning and allow request (fail open)
logger.warning('Redis unavailable for rate limiting, allowing request')
return
Error handling fallback from `enterprise/server/utils/rate_limit_utils.py:80-83`:
except Exception as e:
# Log error but allow request (fail open) to avoid blocking legitimate users
logger.warning(f'Error checking rate limit: {e}', exc_info=True)
return
Atomic Redis rate limit check from `enterprise/server/utils/rate_limit_utils.py:51-53`:
# Try to set the key with expiration. If it already exists (nx=True fails),
# it means the rate limit is active
created = await redis.set(rate_limit_key, 1, nx=True, ex=rate_limit_seconds)