Principle:Langfuse Langfuse Eval Error Handling and Retry

Knowledge Sources	Langfuse
Domains	Error Handling, Job Queue Management
Last Updated	2026-02-14 00:00 GMT

Overview

Eval Error Handling and Retry is the principle of classifying evaluation execution errors into retryable and non-retryable categories, applying appropriate retry strategies with time-bounded limits, and maintaining accurate job execution status throughout the error lifecycle.

Description

Evaluation execution involves calling external LLM APIs, which introduces multiple failure modes: rate limiting (HTTP 429), server errors (HTTP 5xx), client errors (HTTP 4xx), network timeouts, invalid model configurations, and application-level bugs. Each failure mode requires a different response strategy to balance reliability against wasted resources.

The Eval Error Handling and Retry principle categorizes errors into three tiers:

Retryable LLM Errors (429/5xx) -- Rate limit and server errors from LLM providers are transient and likely to succeed on retry. These are handled with a custom delayed retry mechanism: the job execution is set to DELAYED status, and a new queue job is enqueued with an incrementing delay (1-25 minutes). A 24-hour time limit prevents indefinite retries for persistent outages. If the job is older than 24 hours, the execution transitions to ERROR status instead.

Non-Retryable LLM Errors (4xx) -- Client errors such as invalid API keys, model not found, or content policy violations will not succeed on retry. These immediately set the execution to ERROR status with the LLM error message preserved for user debugging. The UnrecoverableError class is also used for application-level configuration errors (invalid model config, missing template, invalid output schema).

Unexpected Application Errors -- Errors that do not fall into the above categories (e.g., database failures, unexpected exceptions) are logged, the execution is set to ERROR with a generic "An internal error occurred" message, and the error is re-thrown to trigger BullMQ's built-in retry mechanism with exponential backoff. This allows transient infrastructure issues to be retried automatically.

The error handling also maintains full observability: every error path updates the job execution record with an executionTraceId that links to the evaluation's internal Langfuse trace, enabling developers to inspect the exact LLM call that failed.

Usage

Use Eval Error Handling and Retry when:

You need to understand why an evaluation execution failed and whether it will be retried
You are debugging evaluation failures visible in the Langfuse UI as ERROR or DELAYED status
You are tuning retry behavior for LLM rate limits in high-volume evaluation scenarios
You need to understand the relationship between BullMQ retries and the custom delayed retry mechanism

Theoretical Basis

The Eval Error Handling and Retry principle implements a tiered error classification with dual retry mechanisms:

Error Classification Decision Tree:

JOB FAILS WITH ERROR
  |
  v
IS IT LLMCompletionError WITH isRetryable=true (429/5xx)?
  |
  +-- YES --> IS JOB < 24 HOURS OLD?
  |             |
  |             +-- YES --> SET status = DELAYED
  |             |           ENQUEUE retry with delay (1-25 min)
  |             |           RETURN (do not throw)
  |             |
  |             +-- NO  --> SET status = ERROR
  |                         message = LLM error message
  |                         RETURN (do not throw)
  |
  +-- NO  --> IS IT LLMCompletionError (non-retryable 4xx)?
                |
                +-- YES --> SET status = ERROR
                |           message = LLM error message
                |           RETURN (do not throw)
                |
                +-- NO  --> IS IT UnrecoverableError (config/validation)?
                              |
                              +-- YES --> SET status = ERROR
                              |           message = error message
                              |           RETURN (do not throw)
                              |
                              +-- NO  --> SET status = ERROR
                                          message = "An internal error occurred"
                                          LOG error + send to exception tracker
                                          THROW (triggers BullMQ retry)

Dual Retry Mechanisms:

The system employs two distinct retry mechanisms for different failure modes:

Mechanism	Trigger	Delay Strategy	Max Duration	Status During Retry
Custom Delayed Retry	LLM 429/5xx errors	Incrementing delay: 1-25 minutes per attempt	24 hours from original job creation	DELAYED
BullMQ Built-in Retry	Unexpected application errors	Exponential backoff: 5s base, 5 attempts	~80 seconds total	PENDING (BullMQ manages)

Custom Delayed Retry Flow:

FUNCTION retryLLMRateLimitError(job, config):
  originalTimestamp = job.data.retryBaggage.originalJobTimestamp
  currentAttempt = job.data.retryBaggage.attempt

  IF (now - originalTimestamp) > 24 HOURS:
    // Stop retrying -- too old
    UPDATE execution SET status = ERROR
    RETURN

  delay = delayInMs(currentAttempt)  // 1-25 minutes, incrementing
  newAttempt = currentAttempt + 1

  ENQUEUE new job to EvalExecutionQueue {
    ...originalPayload,
    retryBaggage: {
      originalJobTimestamp: originalTimestamp,
      attempt: newAttempt
    },
    delay: delay
  }

Job Execution Status Lifecycle:

PENDING  -->  (execution starts)
  |
  +--  SUCCESS  --> COMPLETED (with jobOutputScoreId)
  |
  +--  RETRYABLE ERROR --> DELAYED --> (retry) --> PENDING ...
  |
  +--  NON-RETRYABLE ERROR --> ERROR (with error message)
  |
  +--  CANCELLED (by trace deselection) --> CANCELLED

Observability Integration:

Every error path records an executionTraceId derived deterministically from the job execution ID using W3C trace ID format. This allows the evaluation's internal Langfuse trace to be linked from the job execution record, providing a single click path from a failed evaluation to the exact LLM request and response.

Idempotency Considerations:

The custom delayed retry creates a new BullMQ job rather than re-using the failed job. This means the original job completes (without error) while a new job is scheduled. The retry baggage (originalJobTimestamp, attempt counter) is propagated through the new job's data, maintaining continuity across retries. The job execution record in PostgreSQL is updated to DELAYED status to reflect the pending retry.

Related Pages

Implemented By

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment