Heuristic:Truera Trulens Rate Limiting And Retry Strategy
| Knowledge Sources | |
|---|---|
| Domains | Optimization, Reliability |
| Last Updated | 2026-02-14 08:00 GMT |
Overview
Rate limiting and exponential backoff retry strategy that paces API requests at 60 RPM with up to 3 retries (2s, 4s, 8s delay) while skipping authentication and quota errors.
Description
TruLens implements a two-layer request management strategy for all LLM API endpoints. The first layer is pacing: a token-bucket rate limiter that constrains outgoing requests to a configurable requests-per-minute (RPM) threshold, defaulting to 60 RPM. The second layer is retry with exponential backoff: when a request fails, it is retried up to 3 times with exponentially increasing delays (2s, 4s, 8s). Critically, errors matching authentication, authorization, expiration, or quota patterns are not retried since they indicate permanent failures.
Usage
Apply this heuristic when configuring feedback endpoint performance. If you are hitting rate limits from your LLM provider, reduce the `rpm` parameter on the endpoint. If you are experiencing transient failures, the default retry strategy handles them automatically. For high-throughput evaluation scenarios, increase `rpm` if your provider plan allows it.
The Insight (Rule of Thumb)
- Action: Configure `rpm` parameter when initializing an endpoint or provider. Default is 60 RPM.
- Value: `rpm=60` (1 request/second average), `retries=3` (4 total attempts), initial delay `2.0s` doubling each retry.
- Trade-off: Lower RPM reduces risk of rate limit errors but slows evaluation throughput. Higher RPM risks 429 errors from the provider.
- Critical detail: Errors containing "authentication", "unauthorized", "expired", or "quota" are never retried — they abort immediately. This prevents wasting retry attempts on permanent failures.
Reasoning
LLM API providers (OpenAI, Bedrock, Google) enforce rate limits that vary by plan tier. The 60 RPM default is a conservative baseline that works within most free/standard tier limits. Exponential backoff is a standard reliability pattern that prevents thundering herd problems during transient outages. The non-retry regex filter prevents wasting time and money retrying requests that will always fail (e.g., expired API keys, exhausted quotas).
Empirical evidence: The retry delay sequence (2s → 4s → 8s) totals 14 seconds of maximum wait time, which aligns with typical provider rate-limit reset windows.
Code Evidence
Default RPM and retry config from `src/core/trulens/core/feedback/endpoint.py:48,194-198`:
DEFAULT_RPM = 60
"""Default requests per minute for endpoints."""
rpm: float = DEFAULT_RPM
"""Requests per minute."""
retries: int = 3
"""Retries (if performing requests using this class)."""
Exponential backoff implementation from `src/core/trulens/core/feedback/endpoint.py:306-343`:
def run_in_pace(self, func: Callable[[A], B], *args, **kwargs) -> B:
"""Run the given `func` on the given `args` and `kwargs` at pace with the
endpoint-specified rpm. Failures will be retried `self.retries` times."""
retries = self.retries + 1
attempts = 0
retry_delay = 2.0
errors = []
while retries > 0:
try:
self.pace_me()
attempts += 1
ret = func(*args, **kwargs)
return ret
except Exception as e:
retries -= 1
logger.error(
"%s request failed %s=%s. Retries remaining=%s.",
self.name, type(e), e, retries,
)
errors.append(e)
if not self._can_retry(e):
break
if retries > 0:
sleep(retry_delay)
retry_delay *= 2
Non-retryable pattern from `src/core/trulens/core/feedback/endpoint.py:56-63`:
_RE_NO_RETRY = re.compile(
"(" + ("|".join(["authentication", "unauthorized", "expired", "quota"])) + ")",
re.IGNORECASE,
)