Heuristic:Togethercomputer Together python Retry Backoff Strategy
| Knowledge Sources | |
|---|---|
| Domains | Networking, Reliability |
| Last Updated | 2026-02-15 16:00 GMT |
Overview
Exponential backoff with jitter strategy for handling API rate limits and transient failures, starting at 0.5s and capped at 8s.
Description
The Together SDK implements a retry strategy that combines server-guided delays (via `Retry-After` headers) with exponential backoff and randomized jitter. When the API returns a retryable error (rate limit, server error), the SDK waits for an increasing delay before retrying, up to 5 attempts. The jitter prevents multiple clients from retrying simultaneously (thundering herd problem).
Usage
This heuristic applies automatically to all API calls made through the `Together()` or `AsyncTogether()` client. Understand this pattern when:
- Debugging slow API calls that involve retries
- Tuning `max_retries` or `timeout` parameters
- Building applications that need predictable latency
The Insight (Rule of Thumb)
- Action: The SDK automatically retries failed requests with exponential backoff + jitter.
- Value: 5 max retries; delay starts at 0.5s, doubles each retry, capped at 8s; 25% random jitter applied.
- Trade-off: More retries increase reliability but add latency. A fully exhausted retry sequence takes ~16s of waiting.
- Server Override: If the API returns a `Retry-After` header with a value <= 60s, the SDK respects it instead of calculating its own delay.
Retry timing sequence (without server override):
- Retry 1: ~0.5s (range: 0.375s - 0.5s)
- Retry 2: ~1.0s (range: 0.75s - 1.0s)
- Retry 3: ~2.0s (range: 1.5s - 2.0s)
- Retry 4: ~4.0s (range: 3.0s - 4.0s)
- Retry 5: ~8.0s (range: 6.0s - 8.0s)
Session management:
- HTTP sessions are thread-local and recycled every 180 seconds to prevent connection staleness.
- Each session has 2 connection-level retries (urllib3 HTTPAdapter) in addition to the 5 application-level retries.
Reasoning
Exponential backoff prevents overwhelming a rate-limited API. The jitter (25% variance via `1 - 0.25 * random()`) prevents synchronized retry storms when multiple clients hit rate limits simultaneously. The server-guided delay (`Retry-After` header) takes priority because the server has the best knowledge of when capacity will be available. The 60-second cap on respecting `Retry-After` prevents a malformed server response from causing indefinite waits.
Session recycling every 3 minutes prevents issues with stale TCP connections in long-running processes, while still benefiting from connection reuse for short bursts of requests.
Constants from `src/together/constants.py:5-10`:
TIMEOUT_SECS = 600 # 10-minute request timeout
MAX_SESSION_LIFETIME_SECS = 180 # 3-minute session lifetime
MAX_CONNECTION_RETRIES = 2 # urllib3-level retries
MAX_RETRIES = 5 # Application-level retries
INITIAL_RETRY_DELAY = 0.5 # Starting backoff delay
MAX_RETRY_DELAY = 8.0 # Maximum backoff delay
Backoff calculation from `src/together/abstract/api_requestor.py:152-170`:
def _calculate_retry_timeout(
self, remaining_retries, response_headers=None,
) -> float:
retry_after = self._parse_retry_after_header(response_headers)
if retry_after is not None and 0 < retry_after <= 60:
return retry_after
nb_retries = self.retries - remaining_retries
sleep_seconds = min(INITIAL_RETRY_DELAY * pow(2.0, nb_retries), MAX_RETRY_DELAY)
jitter = 1 - 0.25 * random()
timeout = sleep_seconds * jitter
return timeout if timeout >= 0 else 0
Session recycling from `src/together/abstract/api_requestor.py:478-487`:
if not hasattr(_thread_context, "session"):
_thread_context.session = _make_session(MAX_CONNECTION_RETRIES)
_thread_context.session_create_time = time.time()
elif (
time.time() - getattr(_thread_context, "session_create_time", 0)
>= MAX_SESSION_LIFETIME_SECS
):
_thread_context.session.close()
_thread_context.session = _make_session(MAX_CONNECTION_RETRIES)
_thread_context.session_create_time = time.time()