Heuristic:Ucbepic Docetl Validation Retry Strategy
| Knowledge Sources | |
|---|---|
| Domains | LLM_Pipelines, Debugging |
| Last Updated | 2026-02-08 01:00 GMT |
Overview
Default retry strategy of 2 retries (3 total attempts) when LLM output fails schema validation, with configurable timeout of 120 seconds per call.
Description
LLM outputs are not guaranteed to match the expected schema. When an output fails validation (e.g., wrong type, missing field, value not in allowed set), DocETL retries the LLM call up to 2 additional times. This retry is separate from rate-limit retries and timeout retries:
- Validation retries: 2 retries (configurable via `num_retries_on_validate_failure`)
- Timeout: 120 seconds per LLM call (configurable via `timeout`)
- Timeout retries: 2 retries per timeout (configurable via `max_retries_per_timeout`)
Together, the worst case for a single LLM call is: 3 validation attempts * 3 timeout attempts * 120 seconds = 18 minutes before giving up.
Usage
Use this heuristic when output quality is inconsistent or you see validation errors in logs. If the LLM consistently fails validation on certain items, increasing retries is unlikely to help — instead, simplify the output schema or improve the prompt. For operations with strict schemas (e.g., enum fields), consider increasing retries to 3-4.
The Insight (Rule of Thumb)
- Action: Set `num_retries_on_validate_failure` in operation config to control validation retries.
- Value: Default 2 retries (3 total attempts).
- Trade-off: More retries = higher LLM cost but better success rate on borderline outputs. Fewer retries = faster failure but may miss valid outputs.
- Timeout: Default 120 seconds per LLM call, with 2 timeout retries.
- Skip on Error: Set `skip_on_error: true` in operation config to skip items that fail all retries instead of halting the pipeline.
Reasoning
LLM outputs are probabilistic. Even well-designed prompts occasionally produce malformed outputs (missing fields, wrong types, values outside allowed ranges). The 2-retry default was chosen because:
- Retry 1: Often succeeds because the error was random (temperature-dependent)
- Retry 2: Catches cases where the first retry happened to produce the same error
- Beyond 2: Diminishing returns — if the LLM fails 3 times, the issue is likely systematic (bad prompt, impossible schema) rather than random
The 120-second timeout accommodates large documents that require significant processing time, while the 2 timeout retries handle transient API slowdowns.
Code Evidence
Validation retry default from `docetl/operations/base.py:78-79`:
self.num_retries_on_validate_failure = self.config.get(
"num_retries_on_validate_failure", 2
)
Timeout defaults from `docetl/operations/utils/api.py:204-205`:
timeout_seconds: int = 120,
max_retries_per_timeout: int = 2,
Skip on error option from `docetl/operations/base.py:86-89`:
class schema(BaseModel, extra="allow"):
name: str
type: str
skip_on_error: bool = False