Heuristic:Langfuse Langfuse BullMQ Retry Strategy Patterns
| Knowledge Sources | |
|---|---|
| Domains | Queue, Resilience |
| Last Updated | 2026-02-14 06:00 GMT |
Overview
Tiered BullMQ retry configuration where critical ingestion queues get 6 attempts with 5-second exponential backoff, deletion queues get 2 attempts with 30-second backoff, and evaluation queues get 10 attempts with 1-second aggressive backoff.
Description
Langfuse operates 25+ BullMQ queues, each tuned for its specific workload characteristics. The retry configuration varies across three dimensions: attempt count (2-10), initial backoff delay (1-30 seconds), and failed job retention (100 to 100,000). These values reflect the team's experience with different failure modes: ingestion failures are transient (network/socket issues), deletion failures are resource-contention based (ClickHouse load), and evaluation failures are rate-limit driven (external LLM APIs).
Usage
Apply this heuristic when creating new BullMQ queues or tuning existing queue performance. Choose retry parameters based on the failure characteristics of the workload rather than applying a one-size-fits-all configuration.
The Insight (Rule of Thumb)
- Ingestion Queues (IngestionQueue, OtelIngestionQueue): 6 attempts, 5s exponential backoff, 100k failed job retention.
- Why: Transient network failures; data must not be lost.
- Deletion Queues (TraceDelete, ScoreDelete, DatasetDelete): 2 attempts, 30s exponential backoff, 100k failed job retention.
- Why: ClickHouse resource contention; long delays between retries allow load to subside.
- Evaluation Queues (EvalExecution, LLMAsJudge): 10 attempts, 1s aggressive backoff, 10k failed job retention.
- Why: External LLM API rate limits (429s) require many fast retries.
- Project Deletion: 10 attempts, 5s backoff, 60s initial delay.
- Why: Complex cascading operation requires more retries and settling time.
- Trace Upsert: Configurable attempts (default: 2), 5s backoff, 30s initial delay.
- Why: Delay allows related ingestion events to complete before triggering side effects.
- Event Propagation: 3 attempts, global concurrency = 1.
- Why: Must process sequentially for ordered partition processing.
- Notifications: 5 attempts, 3s backoff (fastest in system).
- Why: Real-time delivery matters; fast retries for transient failures.
Reasoning
The key insight is that backoff delay and retry count should match the failure mode:
| Failure Mode | Retry Count | Backoff Delay | Example Queues |
|---|---|---|---|
| Transient network (socket hang-up) | 5-6 | 5s exponential | Ingestion, OTEL |
| Resource contention (ClickHouse busy) | 2 | 30s exponential | TraceDelete, ScoreDelete |
| External rate limiting (LLM 429) | 10 | 1s exponential | EvalExecution, LLMAsJudge |
| Complex cascading operations | 10 | 5s + 60s initial delay | ProjectDelete |
| Ordered sequential processing | 3 | 5s, concurrency=1 | EventPropagation |
Additionally, removeOnFail varies by importance:
- 100,000 for critical paths (ingestion, deletion) to enable post-mortem analysis
- 10,000 for batch operations
- 1,000 for notifications
- 100 for scheduled/cron jobs
// Example: Ingestion queue (high reliability)
// From packages/shared/src/server/redis/ingestionQueue.ts
defaultJobOptions: {
removeOnComplete: true,
removeOnFail: 100_000,
attempts: 6,
backoff: { type: "exponential", delay: 5000 },
}
// Example: Eval execution (aggressive retry for rate limits)
// From packages/shared/src/server/redis/evalExecutionQueue.ts
defaultJobOptions: {
removeOnComplete: true,
removeOnFail: 10_000,
attempts: 10,
backoff: { type: "exponential", delay: 1000 },
}
Related Pages
- Implementation:Langfuse_Langfuse_IngestionQueue
- Implementation:Langfuse_Langfuse_TraceUpsertQueue
- Implementation:Langfuse_Langfuse_EvalJobExecutorQueueProcessor
- Implementation:Langfuse_Langfuse_DatasetRunItemUpsertQueue
- Principle:Langfuse_Langfuse_Ingestion_Queue_Dispatch
- Principle:Langfuse_Langfuse_Eval_Error_Handling_and_Retry