Heuristic:Ucbepic Docetl Optimizer Sample Sizes

Knowledge Sources	DocETL Internal
Domains	Optimization, LLM_Pipelines
Last Updated	2026-02-08 01:00 GMT

Overview

Operation-specific sample size defaults (5-100 items) used by the optimizer to evaluate pipeline performance before rewriting operations.

Description

When the DocETL optimizer analyzes a pipeline, it samples a subset of the input data to evaluate each operation's performance. Different operation types use different sample sizes based on how much data is needed to characterize their behavior. Simple operations (map, filter) need only 5 items, while complex multi-item operations (resolve, equijoin) need 100.

This is a critical cost-control mechanism: running the optimizer on full datasets would be prohibitively expensive, so these sample sizes represent the minimum data needed for meaningful optimization decisions.

Usage

Use this heuristic when configuring the optimizer or troubleshooting optimization quality. If the optimizer produces poor recommendations, increasing the sample size for the relevant operation type may help. Conversely, reducing sample sizes can speed up optimization at the cost of accuracy.

The Insight (Rule of Thumb)

Action: The optimizer uses hard-coded sample sizes per operation type, defined in `SAMPLE_SIZE_MAP`.
Value:
- `map`: 5 items
- `filter`: 5 items
- `split`: 10 items
- `gather`: 10 items
- `unnest`: 10 items
- `reduce`: 40 items
- `resolve`: 100 items
- `equijoin`: 100 items
Trade-off: Smaller samples = faster/cheaper optimization but may miss edge cases. Larger samples = more accurate but higher LLM cost.
Override: Sample sizes can be overridden per-operation in the pipeline YAML config via `optimize` flag or optimizer_config.

Reasoning

The sample sizes reflect the inherent complexity of each operation:

Map/Filter (5): These operate on individual documents independently. A small sample is sufficient to evaluate prompt quality since each document is processed in isolation.
Split/Gather/Unnest (10): These involve structural transformations. Slightly more samples help capture variation in document structure (e.g., different section counts, nesting depths).
Reduce (40): Reduce aggregates multiple items into one, so the optimizer needs enough groups of items to evaluate how well the fold/merge prompts work across different group sizes and compositions.
Resolve/Equijoin (100): These involve pairwise comparisons (O(n^2) potential pairs). 100 items provide enough diversity to find meaningful blocking thresholds and comparison prompts. The optimizer also uses these samples for automatic threshold calibration.

Code Evidence

From `docetl/optimizer.py:36-45`:

SAMPLE_SIZE_MAP = {
    "reduce": 40,
    "map": 5,
    "resolve": 100,
    "equijoin": 100,
    "filter": 5,
    "split": 10,
    "gather": 10,
    "unnest": 10,
}

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment