Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Heuristic:Togethercomputer Together python Retry Backoff Strategy

From Leeroopedia
Revision as of 10:40, 16 February 2026 by Admin (talk | contribs) (Auto-imported from heuristics/Togethercomputer_Together_python_Retry_Backoff_Strategy.md)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Knowledge Sources
Domains Networking, Reliability
Last Updated 2026-02-15 16:00 GMT

Overview

Exponential backoff with jitter strategy for handling API rate limits and transient failures, starting at 0.5s and capped at 8s.

Description

The Together SDK implements a retry strategy that combines server-guided delays (via `Retry-After` headers) with exponential backoff and randomized jitter. When the API returns a retryable error (rate limit, server error), the SDK waits for an increasing delay before retrying, up to 5 attempts. The jitter prevents multiple clients from retrying simultaneously (thundering herd problem).

Usage

This heuristic applies automatically to all API calls made through the `Together()` or `AsyncTogether()` client. Understand this pattern when:

  • Debugging slow API calls that involve retries
  • Tuning `max_retries` or `timeout` parameters
  • Building applications that need predictable latency

The Insight (Rule of Thumb)

  • Action: The SDK automatically retries failed requests with exponential backoff + jitter.
  • Value: 5 max retries; delay starts at 0.5s, doubles each retry, capped at 8s; 25% random jitter applied.
  • Trade-off: More retries increase reliability but add latency. A fully exhausted retry sequence takes ~16s of waiting.
  • Server Override: If the API returns a `Retry-After` header with a value <= 60s, the SDK respects it instead of calculating its own delay.

Retry timing sequence (without server override):

  • Retry 1: ~0.5s (range: 0.375s - 0.5s)
  • Retry 2: ~1.0s (range: 0.75s - 1.0s)
  • Retry 3: ~2.0s (range: 1.5s - 2.0s)
  • Retry 4: ~4.0s (range: 3.0s - 4.0s)
  • Retry 5: ~8.0s (range: 6.0s - 8.0s)

Session management:

  • HTTP sessions are thread-local and recycled every 180 seconds to prevent connection staleness.
  • Each session has 2 connection-level retries (urllib3 HTTPAdapter) in addition to the 5 application-level retries.

Reasoning

Exponential backoff prevents overwhelming a rate-limited API. The jitter (25% variance via `1 - 0.25 * random()`) prevents synchronized retry storms when multiple clients hit rate limits simultaneously. The server-guided delay (`Retry-After` header) takes priority because the server has the best knowledge of when capacity will be available. The 60-second cap on respecting `Retry-After` prevents a malformed server response from causing indefinite waits.

Session recycling every 3 minutes prevents issues with stale TCP connections in long-running processes, while still benefiting from connection reuse for short bursts of requests.

Constants from `src/together/constants.py:5-10`:

TIMEOUT_SECS = 600          # 10-minute request timeout
MAX_SESSION_LIFETIME_SECS = 180  # 3-minute session lifetime
MAX_CONNECTION_RETRIES = 2   # urllib3-level retries
MAX_RETRIES = 5              # Application-level retries
INITIAL_RETRY_DELAY = 0.5   # Starting backoff delay
MAX_RETRY_DELAY = 8.0       # Maximum backoff delay

Backoff calculation from `src/together/abstract/api_requestor.py:152-170`:

def _calculate_retry_timeout(
    self, remaining_retries, response_headers=None,
) -> float:
    retry_after = self._parse_retry_after_header(response_headers)
    if retry_after is not None and 0 < retry_after <= 60:
        return retry_after

    nb_retries = self.retries - remaining_retries
    sleep_seconds = min(INITIAL_RETRY_DELAY * pow(2.0, nb_retries), MAX_RETRY_DELAY)
    jitter = 1 - 0.25 * random()
    timeout = sleep_seconds * jitter
    return timeout if timeout >= 0 else 0

Session recycling from `src/together/abstract/api_requestor.py:478-487`:

if not hasattr(_thread_context, "session"):
    _thread_context.session = _make_session(MAX_CONNECTION_RETRIES)
    _thread_context.session_create_time = time.time()
elif (
    time.time() - getattr(_thread_context, "session_create_time", 0)
    >= MAX_SESSION_LIFETIME_SECS
):
    _thread_context.session.close()
    _thread_context.session = _make_session(MAX_CONNECTION_RETRIES)
    _thread_context.session_create_time = time.time()

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment