Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Heuristic:Langfuse Langfuse BullMQ Retry Strategy Patterns

From Leeroopedia
Knowledge Sources
Domains Queue, Resilience
Last Updated 2026-02-14 06:00 GMT

Overview

Tiered BullMQ retry configuration where critical ingestion queues get 6 attempts with 5-second exponential backoff, deletion queues get 2 attempts with 30-second backoff, and evaluation queues get 10 attempts with 1-second aggressive backoff.

Description

Langfuse operates 25+ BullMQ queues, each tuned for its specific workload characteristics. The retry configuration varies across three dimensions: attempt count (2-10), initial backoff delay (1-30 seconds), and failed job retention (100 to 100,000). These values reflect the team's experience with different failure modes: ingestion failures are transient (network/socket issues), deletion failures are resource-contention based (ClickHouse load), and evaluation failures are rate-limit driven (external LLM APIs).

Usage

Apply this heuristic when creating new BullMQ queues or tuning existing queue performance. Choose retry parameters based on the failure characteristics of the workload rather than applying a one-size-fits-all configuration.

The Insight (Rule of Thumb)

  • Ingestion Queues (IngestionQueue, OtelIngestionQueue): 6 attempts, 5s exponential backoff, 100k failed job retention.
    • Why: Transient network failures; data must not be lost.
  • Deletion Queues (TraceDelete, ScoreDelete, DatasetDelete): 2 attempts, 30s exponential backoff, 100k failed job retention.
    • Why: ClickHouse resource contention; long delays between retries allow load to subside.
  • Evaluation Queues (EvalExecution, LLMAsJudge): 10 attempts, 1s aggressive backoff, 10k failed job retention.
    • Why: External LLM API rate limits (429s) require many fast retries.
  • Project Deletion: 10 attempts, 5s backoff, 60s initial delay.
    • Why: Complex cascading operation requires more retries and settling time.
  • Trace Upsert: Configurable attempts (default: 2), 5s backoff, 30s initial delay.
    • Why: Delay allows related ingestion events to complete before triggering side effects.
  • Event Propagation: 3 attempts, global concurrency = 1.
    • Why: Must process sequentially for ordered partition processing.
  • Notifications: 5 attempts, 3s backoff (fastest in system).
    • Why: Real-time delivery matters; fast retries for transient failures.

Reasoning

The key insight is that backoff delay and retry count should match the failure mode:

Failure Mode Retry Count Backoff Delay Example Queues
Transient network (socket hang-up) 5-6 5s exponential Ingestion, OTEL
Resource contention (ClickHouse busy) 2 30s exponential TraceDelete, ScoreDelete
External rate limiting (LLM 429) 10 1s exponential EvalExecution, LLMAsJudge
Complex cascading operations 10 5s + 60s initial delay ProjectDelete
Ordered sequential processing 3 5s, concurrency=1 EventPropagation

Additionally, removeOnFail varies by importance:

  • 100,000 for critical paths (ingestion, deletion) to enable post-mortem analysis
  • 10,000 for batch operations
  • 1,000 for notifications
  • 100 for scheduled/cron jobs
// Example: Ingestion queue (high reliability)
// From packages/shared/src/server/redis/ingestionQueue.ts
defaultJobOptions: {
  removeOnComplete: true,
  removeOnFail: 100_000,
  attempts: 6,
  backoff: { type: "exponential", delay: 5000 },
}

// Example: Eval execution (aggressive retry for rate limits)
// From packages/shared/src/server/redis/evalExecutionQueue.ts
defaultJobOptions: {
  removeOnComplete: true,
  removeOnFail: 10_000,
  attempts: 10,
  backoff: { type: "exponential", delay: 1000 },
}

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment