Heuristic:Langfuse Langfuse LLM Rate Limit 24h Abandon
| Knowledge Sources | |
|---|---|
| Domains | Evaluation, Resilience |
| Last Updated | 2026-02-14 06:00 GMT |
Overview
Evaluation and experiment LLM jobs that have been rate-limited (429 or 5xx) for more than 24 hours are abandoned rather than retried indefinitely, preventing resource waste on permanently failing jobs.
Description
When LLM API providers return rate-limit errors (HTTP 429) or server errors (HTTP 5xx), Langfuse retries the job with exponential backoff. However, if a job has been continuously rate-limited for over 24 hours (measured from the job's created_at timestamp in the database), the system stops retrying. This prevents queue backlog from growing indefinitely when an LLM provider has a prolonged outage or the user's API key has been permanently rate-limited.
Usage
This heuristic is automatically applied in the worker's retry handler for evaluation and experiment LLM completion jobs. It affects any job that calls external LLM APIs (OpenAI, Anthropic, Azure, Bedrock, Google, etc.).
The Insight (Rule of Thumb)
- Action: Check the job's age before retrying rate-limited LLM calls; abandon if older than 24 hours.
- Value: Prevents infinite retry loops and unbounded queue growth.
- Trade-off: Jobs that genuinely need >24 hours of retries will be silently dropped. Users must re-trigger evaluations manually for these cases.
- Applies To: Evaluation jobs (eval execution queue) and experiment jobs (experiment create queue).
Reasoning
The 24-hour threshold balances two concerns:
- Transient outages (minutes to hours): Should be retried. Common with LLM API rate limits.
- Permanent failures (days): Should not be retried. Examples: revoked API key, provider policy change, quota exhaustion.
The implementation fetches the job's created_at timestamp from the database (not the BullMQ job timestamp) to ensure accuracy even if the job has been rescheduled multiple times:
// From worker/src/features/utils/retry-handler.ts
const ONE_DAY_IN_MS = 24 * 60 * 60 * 1000;
const record = await kyselyPrisma.$kysely
.selectFrom(config.table)
.select("created_at")
.where("id", "=", jobId)
.executeTakeFirstOrThrow();
if (record.created_at < new Date(Date.now() - ONE_DAY_IN_MS)) {
logger.info(`Job ${jobId} is rate limited for more than 24h. Stop retrying.`);
return; // Don't retry - abandon the job
}
Error classification determines retryability:
- HTTP 429 (rate limit): Retryable with custom delay
- HTTP 5xx (server error): Retryable with custom delay
- HTTP 4xx (client error, except 429): Not retryable (permanent failure)