Heuristic:Ucbepic Docetl Validation Retry Strategy

Knowledge Sources	DocETL Internal
Domains	LLM_Pipelines, Debugging
Last Updated	2026-02-08 01:00 GMT

Overview

Default retry strategy of 2 retries (3 total attempts) when LLM output fails schema validation, with configurable timeout of 120 seconds per call.

Description

LLM outputs are not guaranteed to match the expected schema. When an output fails validation (e.g., wrong type, missing field, value not in allowed set), DocETL retries the LLM call up to 2 additional times. This retry is separate from rate-limit retries and timeout retries:

Validation retries: 2 retries (configurable via `num_retries_on_validate_failure`)
Timeout: 120 seconds per LLM call (configurable via `timeout`)
Timeout retries: 2 retries per timeout (configurable via `max_retries_per_timeout`)

Together, the worst case for a single LLM call is: 3 validation attempts * 3 timeout attempts * 120 seconds = 18 minutes before giving up.

Usage

Use this heuristic when output quality is inconsistent or you see validation errors in logs. If the LLM consistently fails validation on certain items, increasing retries is unlikely to help — instead, simplify the output schema or improve the prompt. For operations with strict schemas (e.g., enum fields), consider increasing retries to 3-4.

The Insight (Rule of Thumb)

Action: Set `num_retries_on_validate_failure` in operation config to control validation retries.
Value: Default 2 retries (3 total attempts).
Trade-off: More retries = higher LLM cost but better success rate on borderline outputs. Fewer retries = faster failure but may miss valid outputs.
Timeout: Default 120 seconds per LLM call, with 2 timeout retries.
Skip on Error: Set `skip_on_error: true` in operation config to skip items that fail all retries instead of halting the pipeline.

Reasoning

LLM outputs are probabilistic. Even well-designed prompts occasionally produce malformed outputs (missing fields, wrong types, values outside allowed ranges). The 2-retry default was chosen because:

Retry 1: Often succeeds because the error was random (temperature-dependent)
Retry 2: Catches cases where the first retry happened to produce the same error
Beyond 2: Diminishing returns — if the LLM fails 3 times, the issue is likely systematic (bad prompt, impossible schema) rather than random

The 120-second timeout accommodates large documents that require significant processing time, while the 2 timeout retries handle transient API slowdowns.

Code Evidence

Validation retry default from `docetl/operations/base.py:78-79`:

self.num_retries_on_validate_failure = self.config.get(
    "num_retries_on_validate_failure", 2
)

Timeout defaults from `docetl/operations/utils/api.py:204-205`:

timeout_seconds: int = 120,
max_retries_per_timeout: int = 2,

Skip on error option from `docetl/operations/base.py:86-89`:

class schema(BaseModel, extra="allow"):
    name: str
    type: str
    skip_on_error: bool = False

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment