Heuristic:OpenHands OpenHands Fail Open Rate Limiting

Knowledge Sources	OpenHands Enterprise rate limiting implementation
Domains	SaaS_Infrastructure, Distributed_Systems
Last Updated	2026-02-11 21:00 GMT

Overview

Rate limiting implementation that fails open (allows requests) when Redis is unavailable, prioritizing availability of legitimate users over strict abuse prevention.

Description

OpenHands implements a fail-open rate limiting strategy using Redis as the backing store. When Redis is unavailable or an error occurs during the rate limit check, the system allows the request to proceed rather than blocking it. This is a deliberate architectural decision: blocking legitimate users due to an infrastructure failure (Redis outage) is considered worse than temporarily allowing potential abuse. The rate limiter uses Redis atomic set-with-expiry (`SET key value NX EX ttl`) for both checking and setting rate limits in a single operation, providing race-condition-free enforcement across distributed server instances.

Usage

Apply this pattern when implementing any rate limiting in the OpenHands SaaS server. It is particularly important for authentication flows (email verification, signup) where blocking a legitimate user during a Redis outage would be a poor user experience. The fail-open pattern should not be used for security-critical operations where allowing unauthenticated access would be dangerous.

The Insight (Rule of Thumb)

Action: Wrap all Redis rate limit checks in try/except. On Redis failure, log a warning and allow the request.
Value: Default rate limits are 120 seconds per user ID and 300 seconds per IP address.
Trade-off: During Redis outages, rate limiting is effectively disabled. This accepts temporary abuse risk in exchange for zero downtime for legitimate users.
Fallback Strategy: When user ID is unavailable, fall back to IP-based rate limiting with a longer window (300s vs 120s) to account for shared IPs.

Reasoning

The fail-open strategy is justified by several observations:

Redis outages are rare: In a well-maintained production environment, Redis downtime is uncommon and typically brief.
Abuse during outages is unlikely: Attackers are unlikely to time their attacks precisely with Redis outages.
User impact is disproportionate: Blocking a legitimate user during signup or email verification creates a terrible user experience that may cause them to abandon the product entirely.
Compensating controls exist: Other security layers (reCAPTCHA, Keycloak rate limits, infrastructure WAFs) provide defense even when application-level rate limiting is unavailable.
Atomic Redis operations prevent race conditions: Using `NX` (set if not exists) with `EX` (expiry) ensures that the check-and-set is atomic, preventing race conditions in distributed deployments where multiple server instances check the same key simultaneously.

Code evidence from `enterprise/server/utils/rate_limit_utils.py:34-39`:

redis = sio.manager.redis
if not redis:
    # If Redis is unavailable, log warning and allow request (fail open)
    logger.warning('Redis unavailable for rate limiting, allowing request')
    return

Error handling fallback from `enterprise/server/utils/rate_limit_utils.py:80-83`:

except Exception as e:
    # Log error but allow request (fail open) to avoid blocking legitimate users
    logger.warning(f'Error checking rate limit: {e}', exc_info=True)
    return

Atomic Redis rate limit check from `enterprise/server/utils/rate_limit_utils.py:51-53`:

# Try to set the key with expiration. If it already exists (nx=True fails),
# it means the rate limit is active
created = await redis.set(rate_limit_key, 1, nx=True, ex=rate_limit_seconds)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment