Heuristic:Langfuse Langfuse BullMQ Retry Strategy Patterns

Knowledge Sources	Langfuse BullMQ Retry
Domains	Queue, Resilience
Last Updated	2026-02-14 06:00 GMT

Overview

Tiered BullMQ retry configuration where critical ingestion queues get 6 attempts with 5-second exponential backoff, deletion queues get 2 attempts with 30-second backoff, and evaluation queues get 10 attempts with 1-second aggressive backoff.

Description

Langfuse operates 25+ BullMQ queues, each tuned for its specific workload characteristics. The retry configuration varies across three dimensions: attempt count (2-10), initial backoff delay (1-30 seconds), and failed job retention (100 to 100,000). These values reflect the team's experience with different failure modes: ingestion failures are transient (network/socket issues), deletion failures are resource-contention based (ClickHouse load), and evaluation failures are rate-limit driven (external LLM APIs).

Usage

Apply this heuristic when creating new BullMQ queues or tuning existing queue performance. Choose retry parameters based on the failure characteristics of the workload rather than applying a one-size-fits-all configuration.

The Insight (Rule of Thumb)

Ingestion Queues (IngestionQueue, OtelIngestionQueue): 6 attempts, 5s exponential backoff, 100k failed job retention.
- Why: Transient network failures; data must not be lost.
Deletion Queues (TraceDelete, ScoreDelete, DatasetDelete): 2 attempts, 30s exponential backoff, 100k failed job retention.
- Why: ClickHouse resource contention; long delays between retries allow load to subside.
Evaluation Queues (EvalExecution, LLMAsJudge): 10 attempts, 1s aggressive backoff, 10k failed job retention.
- Why: External LLM API rate limits (429s) require many fast retries.
Project Deletion: 10 attempts, 5s backoff, 60s initial delay.
- Why: Complex cascading operation requires more retries and settling time.
Trace Upsert: Configurable attempts (default: 2), 5s backoff, 30s initial delay.
- Why: Delay allows related ingestion events to complete before triggering side effects.
Event Propagation: 3 attempts, global concurrency = 1.
- Why: Must process sequentially for ordered partition processing.
Notifications: 5 attempts, 3s backoff (fastest in system).
- Why: Real-time delivery matters; fast retries for transient failures.

Reasoning

The key insight is that backoff delay and retry count should match the failure mode:

Failure Mode	Retry Count	Backoff Delay	Example Queues
Transient network (socket hang-up)	5-6	5s exponential	Ingestion, OTEL
Resource contention (ClickHouse busy)	2	30s exponential	TraceDelete, ScoreDelete
External rate limiting (LLM 429)	10	1s exponential	EvalExecution, LLMAsJudge
Complex cascading operations	10	5s + 60s initial delay	ProjectDelete
Ordered sequential processing	3	5s, concurrency=1	EventPropagation

Additionally, removeOnFail varies by importance:

100,000 for critical paths (ingestion, deletion) to enable post-mortem analysis
10,000 for batch operations
1,000 for notifications
100 for scheduled/cron jobs

// Example: Ingestion queue (high reliability)
// From packages/shared/src/server/redis/ingestionQueue.ts
defaultJobOptions: {
  removeOnComplete: true,
  removeOnFail: 100_000,
  attempts: 6,
  backoff: { type: "exponential", delay: 5000 },
}

// Example: Eval execution (aggressive retry for rate limits)
// From packages/shared/src/server/redis/evalExecutionQueue.ts
defaultJobOptions: {
  removeOnComplete: true,
  removeOnFail: 10_000,
  attempts: 10,
  backoff: { type: "exponential", delay: 1000 },
}

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment