Heuristic:Treeverse LakeFS Retry Backoff Configuration
| Knowledge Sources | |
|---|---|
| Domains | Optimization, Debugging |
| Last Updated | 2026-02-08 10:00 GMT |
Overview
lakeFS uses differentiated retry strategies: 4 retries with 30-second max for API calls, 75 retries with 1-second max for browser login, and exponential backoff for async polling.
Description
The lakeFS CLI (lakectl) and integration tests use multiple retry and backoff configurations tuned for different operational contexts. Regular API calls use conservative retries (4 attempts, 200ms-30s interval), browser login uses aggressive polling (75 attempts, 50ms-1s), and async operation status polling uses exponential backoff (1s-10s interval, 1-hour total timeout). Only specific HTTP status codes trigger retries: 429 (Too Many Requests), 500 (Internal Server Error), and 503 (Service Unavailable). Login additionally retries on 404 (Not Found).
Usage
Use this heuristic when implementing clients that interact with lakeFS APIs, debugging timeout issues, or configuring retry behavior. Understanding these patterns prevents both excessive retrying (wasting resources) and insufficient retrying (premature failures).
The Insight (Rule of Thumb)
- Action: Match retry strategy to the operation type:
- API calls: 4 attempts max, 200ms to 30s exponential backoff
- Browser login: 75 attempts max, 50ms to 1s fast polling
- Async status: 1s to 10s exponential backoff, 1 hour total timeout
- Value: Only retry on 429, 500, 503 (plus 404 for login). Never retry on TLS errors, context cancellation, or too many redirects.
- Trade-off: Conservative retries for API calls mean faster failure detection but may miss transient errors. Aggressive login polling ensures responsive UX but generates more requests.
Reasoning
The differentiated retry strategy reflects the operational characteristics of each use case. API calls to a running server should succeed quickly or fail definitively, hence few retries with increasing backoff. Browser login polls a possibly-not-yet-ready auth flow, requiring many fast attempts. Async operations (commits, merges) can legitimately run for minutes on large repositories, justifying the 1-hour timeout. The non-retriable error list (TLS failures, context cancellation) prevents wasting time on errors that will never self-resolve.
Code Evidence
Default retry configuration from `cmd/lakectl/cmd/root.go:182-189`:
defaultHTTP2Enabled = true
defaultMaxAttempts = 4
defaultMaxRetryInterval = 30 * time.Second
defaultMinRetryInterval = 200 * time.Millisecond
defaultBrowserLoginMaxAttempts = 75
defaultBrowserLoginMaxRetryInterval = 1 * time.Second
defaultBrowserLoginMinRetryInterval = 50 * time.Millisecond
Async polling configuration from `cmd/lakectl/cmd/async.go:18-22`:
const (
initialInterval = 1 * time.Second
maxInterval = 10 * time.Second
defaultPollInterval = 3 * time.Second
minimumPollInterval = time.Second
defaultPollTimeout = time.Hour
)
Non-retriable error detection from `cmd/lakectl/cmd/retry_client.go:44-88`:
// Non-retriable errors:
// - Context canceled or deadline exceeded
// - Invalid HTTP scheme/protocol
// - TLS certificate verification failures (x509.UnknownAuthorityError)
// - Too many redirects
// Retriable HTTP statuses: 429, 500, 503
Login-specific retry statuses from `cmd/lakectl/cmd/login.go:38-40`:
var loginRetryStatuses = slices.Concat(lakectlDefaultRetryStatuses,
[]int{http.StatusNotFound},
)