Principle:Langfuse Langfuse Eval Error Handling and Retry
| Knowledge Sources | |
|---|---|
| Domains | Error Handling, Job Queue Management |
| Last Updated | 2026-02-14 00:00 GMT |
Overview
Eval Error Handling and Retry is the principle of classifying evaluation execution errors into retryable and non-retryable categories, applying appropriate retry strategies with time-bounded limits, and maintaining accurate job execution status throughout the error lifecycle.
Description
Evaluation execution involves calling external LLM APIs, which introduces multiple failure modes: rate limiting (HTTP 429), server errors (HTTP 5xx), client errors (HTTP 4xx), network timeouts, invalid model configurations, and application-level bugs. Each failure mode requires a different response strategy to balance reliability against wasted resources.
The Eval Error Handling and Retry principle categorizes errors into three tiers:
- Retryable LLM Errors (429/5xx) -- Rate limit and server errors from LLM providers are transient and likely to succeed on retry. These are handled with a custom delayed retry mechanism: the job execution is set to DELAYED status, and a new queue job is enqueued with an incrementing delay (1-25 minutes). A 24-hour time limit prevents indefinite retries for persistent outages. If the job is older than 24 hours, the execution transitions to ERROR status instead.
- Non-Retryable LLM Errors (4xx) -- Client errors such as invalid API keys, model not found, or content policy violations will not succeed on retry. These immediately set the execution to ERROR status with the LLM error message preserved for user debugging. The UnrecoverableError class is also used for application-level configuration errors (invalid model config, missing template, invalid output schema).
- Unexpected Application Errors -- Errors that do not fall into the above categories (e.g., database failures, unexpected exceptions) are logged, the execution is set to ERROR with a generic "An internal error occurred" message, and the error is re-thrown to trigger BullMQ's built-in retry mechanism with exponential backoff. This allows transient infrastructure issues to be retried automatically.
The error handling also maintains full observability: every error path updates the job execution record with an executionTraceId that links to the evaluation's internal Langfuse trace, enabling developers to inspect the exact LLM call that failed.
Usage
Use Eval Error Handling and Retry when:
- You need to understand why an evaluation execution failed and whether it will be retried
- You are debugging evaluation failures visible in the Langfuse UI as ERROR or DELAYED status
- You are tuning retry behavior for LLM rate limits in high-volume evaluation scenarios
- You need to understand the relationship between BullMQ retries and the custom delayed retry mechanism
Theoretical Basis
The Eval Error Handling and Retry principle implements a tiered error classification with dual retry mechanisms:
Error Classification Decision Tree:
JOB FAILS WITH ERROR
|
v
IS IT LLMCompletionError WITH isRetryable=true (429/5xx)?
|
+-- YES --> IS JOB < 24 HOURS OLD?
| |
| +-- YES --> SET status = DELAYED
| | ENQUEUE retry with delay (1-25 min)
| | RETURN (do not throw)
| |
| +-- NO --> SET status = ERROR
| message = LLM error message
| RETURN (do not throw)
|
+-- NO --> IS IT LLMCompletionError (non-retryable 4xx)?
|
+-- YES --> SET status = ERROR
| message = LLM error message
| RETURN (do not throw)
|
+-- NO --> IS IT UnrecoverableError (config/validation)?
|
+-- YES --> SET status = ERROR
| message = error message
| RETURN (do not throw)
|
+-- NO --> SET status = ERROR
message = "An internal error occurred"
LOG error + send to exception tracker
THROW (triggers BullMQ retry)
Dual Retry Mechanisms:
The system employs two distinct retry mechanisms for different failure modes:
| Mechanism | Trigger | Delay Strategy | Max Duration | Status During Retry |
|---|---|---|---|---|
| Custom Delayed Retry | LLM 429/5xx errors | Incrementing delay: 1-25 minutes per attempt | 24 hours from original job creation | DELAYED |
| BullMQ Built-in Retry | Unexpected application errors | Exponential backoff: 5s base, 5 attempts | ~80 seconds total | PENDING (BullMQ manages) |
Custom Delayed Retry Flow:
FUNCTION retryLLMRateLimitError(job, config):
originalTimestamp = job.data.retryBaggage.originalJobTimestamp
currentAttempt = job.data.retryBaggage.attempt
IF (now - originalTimestamp) > 24 HOURS:
// Stop retrying -- too old
UPDATE execution SET status = ERROR
RETURN
delay = delayInMs(currentAttempt) // 1-25 minutes, incrementing
newAttempt = currentAttempt + 1
ENQUEUE new job to EvalExecutionQueue {
...originalPayload,
retryBaggage: {
originalJobTimestamp: originalTimestamp,
attempt: newAttempt
},
delay: delay
}
Job Execution Status Lifecycle:
PENDING --> (execution starts)
|
+-- SUCCESS --> COMPLETED (with jobOutputScoreId)
|
+-- RETRYABLE ERROR --> DELAYED --> (retry) --> PENDING ...
|
+-- NON-RETRYABLE ERROR --> ERROR (with error message)
|
+-- CANCELLED (by trace deselection) --> CANCELLED
Observability Integration:
Every error path records an executionTraceId derived deterministically from the job execution ID using W3C trace ID format. This allows the evaluation's internal Langfuse trace to be linked from the job execution record, providing a single click path from a failed evaluation to the exact LLM request and response.
Idempotency Considerations:
The custom delayed retry creates a new BullMQ job rather than re-using the failed job. This means the original job completes (without error) while a new job is scheduled. The retry baggage (originalJobTimestamp, attempt counter) is propagated through the new job's data, maintaining continuity across retries. The job execution record in PostgreSQL is updated to DELAYED status to reflect the pending retry.