Heuristic:Treeverse LakeFS Retry Backoff Configuration

Knowledge Sources	lakectl retry client lakectl async polling
Domains	Optimization, Debugging
Last Updated	2026-02-08 10:00 GMT

Overview

lakeFS uses differentiated retry strategies: 4 retries with 30-second max for API calls, 75 retries with 1-second max for browser login, and exponential backoff for async polling.

Description

The lakeFS CLI (lakectl) and integration tests use multiple retry and backoff configurations tuned for different operational contexts. Regular API calls use conservative retries (4 attempts, 200ms-30s interval), browser login uses aggressive polling (75 attempts, 50ms-1s), and async operation status polling uses exponential backoff (1s-10s interval, 1-hour total timeout). Only specific HTTP status codes trigger retries: 429 (Too Many Requests), 500 (Internal Server Error), and 503 (Service Unavailable). Login additionally retries on 404 (Not Found).

Usage

Use this heuristic when implementing clients that interact with lakeFS APIs, debugging timeout issues, or configuring retry behavior. Understanding these patterns prevents both excessive retrying (wasting resources) and insufficient retrying (premature failures).

The Insight (Rule of Thumb)

Action: Match retry strategy to the operation type:
- API calls: 4 attempts max, 200ms to 30s exponential backoff
- Browser login: 75 attempts max, 50ms to 1s fast polling
- Async status: 1s to 10s exponential backoff, 1 hour total timeout
Value: Only retry on 429, 500, 503 (plus 404 for login). Never retry on TLS errors, context cancellation, or too many redirects.
Trade-off: Conservative retries for API calls mean faster failure detection but may miss transient errors. Aggressive login polling ensures responsive UX but generates more requests.

Reasoning

The differentiated retry strategy reflects the operational characteristics of each use case. API calls to a running server should succeed quickly or fail definitively, hence few retries with increasing backoff. Browser login polls a possibly-not-yet-ready auth flow, requiring many fast attempts. Async operations (commits, merges) can legitimately run for minutes on large repositories, justifying the 1-hour timeout. The non-retriable error list (TLS failures, context cancellation) prevents wasting time on errors that will never self-resolve.

Code Evidence

Default retry configuration from `cmd/lakectl/cmd/root.go:182-189`:

defaultHTTP2Enabled     = true
defaultMaxAttempts      = 4
defaultMaxRetryInterval = 30 * time.Second
defaultMinRetryInterval = 200 * time.Millisecond

defaultBrowserLoginMaxAttempts      = 75
defaultBrowserLoginMaxRetryInterval = 1 * time.Second
defaultBrowserLoginMinRetryInterval = 50 * time.Millisecond

Async polling configuration from `cmd/lakectl/cmd/async.go:18-22`:

const (
    initialInterval = 1 * time.Second
    maxInterval     = 10 * time.Second
    defaultPollInterval = 3 * time.Second
    minimumPollInterval = time.Second
    defaultPollTimeout  = time.Hour
)

Non-retriable error detection from `cmd/lakectl/cmd/retry_client.go:44-88`:

// Non-retriable errors:
// - Context canceled or deadline exceeded
// - Invalid HTTP scheme/protocol
// - TLS certificate verification failures (x509.UnknownAuthorityError)
// - Too many redirects

// Retriable HTTP statuses: 429, 500, 503

Login-specific retry statuses from `cmd/lakectl/cmd/login.go:38-40`:

var loginRetryStatuses = slices.Concat(lakectlDefaultRetryStatuses,
    []int{http.StatusNotFound},
)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment