Heuristic:Wandb Weave Retry And Error Handling

Knowledge Sources	Wandb Weave Weave Team
Domains	Reliability, Error_Handling, Networking
Last Updated	2026-02-14 12:00 GMT

Overview

Configurable exponential retry strategy with jitter, distinguishing retryable (5xx, network) from non-retryable (4xx, validation) errors.

Description

Weave uses a tenacity-based retry mechanism with exponential backoff and jitter for all HTTP requests to the trace server. The retry system generates a unique retry ID per request for correlation across attempts, passed via the `X-Weave-Retry-Id` HTTP header. Errors are classified into retryable (server errors, network issues, rate limits) and non-retryable (client errors, validation errors, mode mismatches). Understanding this classification is critical for diagnosing tracing failures and tuning retry behavior.

Usage

Use this heuristic when traces are being dropped due to network errors, when you need to tune retry behavior for reliability vs. latency, or when debugging why specific errors are not being retried. Also relevant when encountering rate limiting (HTTP 429) or server errors (5xx).

The Insight (Rule of Thumb)

Action: Configure retry behavior via `WEAVE_RETRY_MAX_ATTEMPTS` and `WEAVE_RETRY_MAX_INTERVAL`.
Value: Default: 3 attempts, 5-minute max interval. Exponential backoff starts at 1 second with jitter.
Trade-off: More retries improve reliability but increase latency and server load during outages.
Retryable errors:
- HTTP 5xx (server errors)
- HTTP 429 (rate limiting)
- `OSError`, `ConnectionError`, `ConnectionResetError`, `IOError`
Non-retryable errors (never retried):
- `pydantic.ValidationError` — Data format issues
- HTTP 4xx (except 429) — Client errors, bad requests
- `CallsCompleteModeRequired` — Triggers immediate mode switch instead
Action: For HTTP 413 (Payload Too Large), batch processor automatically splits the batch in half and retries each half recursively.
Value: Binary split continues until all sub-batches succeed or contain single items.
Trade-off: Adaptive splitting handles variable payload sizes without pre-computation.

Reasoning

The retry strategy uses `tenacity.wait_exponential_jitter(initial=1, max=retry_max_interval())` which provides randomized exponential backoff. Jitter prevents thundering herd problems when many clients retry simultaneously after a server recovery. The retry ID allows server-side correlation of all attempts for the same logical request, enabling better observability.

The `CallsCompleteModeRequired` exception is special-cased because it indicates a project configuration requirement (the project demands the `calls_complete` write path) rather than a transient error. Instead of retrying, the SDK immediately switches to the new mode.

Code Evidence

Retry decorator from `weave/utils/retry.py:24-54`:

def with_retry(func: Callable[..., T]) -> Callable[..., T]:
    @wraps(func)
    def wrapper(*args: Any, **kwargs: Any) -> T:
        retry_id = generate_id()
        retry_id_token = _retry_id.set(retry_id)

        retry = tenacity.Retrying(
            stop=tenacity.stop_after_attempt(retry_max_attempts()),
            wait=tenacity.wait_exponential_jitter(initial=1, max=retry_max_interval()),
            retry=tenacity.retry_if_exception(_is_retryable_exception),
            before_sleep=_log_retry,
            retry_error_callback=_log_failure,
            reraise=True,
        )
        try:
            return retry(func, *args, **kwargs)
        finally:
            _retry_id.reset(retry_id_token)
    return wrapper

Retryable vs. non-retryable classification from `weave/utils/retry.py:65-86`:

def _is_retryable_exception(e: BaseException) -> bool:
    if isinstance(e, ValidationError):
        return False
    if isinstance(e, CallsCompleteModeRequired):
        return False
    if isinstance(e, httpx.HTTPStatusError) and e.response is not None:
        code_class = e.response.status_code // 100
        if code_class == 4 and e.response.status_code != 429:
            return False
    return True

Retry settings from `weave/trace/settings.py:165-177`:

retry_max_interval: float = 60 * 5  # 5 min
retry_max_attempts: int = 3

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment