Heuristic:Ucbepic Docetl Optimizer Sample Sizes
| Knowledge Sources | |
|---|---|
| Domains | Optimization, LLM_Pipelines |
| Last Updated | 2026-02-08 01:00 GMT |
Overview
Operation-specific sample size defaults (5-100 items) used by the optimizer to evaluate pipeline performance before rewriting operations.
Description
When the DocETL optimizer analyzes a pipeline, it samples a subset of the input data to evaluate each operation's performance. Different operation types use different sample sizes based on how much data is needed to characterize their behavior. Simple operations (map, filter) need only 5 items, while complex multi-item operations (resolve, equijoin) need 100.
This is a critical cost-control mechanism: running the optimizer on full datasets would be prohibitively expensive, so these sample sizes represent the minimum data needed for meaningful optimization decisions.
Usage
Use this heuristic when configuring the optimizer or troubleshooting optimization quality. If the optimizer produces poor recommendations, increasing the sample size for the relevant operation type may help. Conversely, reducing sample sizes can speed up optimization at the cost of accuracy.
The Insight (Rule of Thumb)
- Action: The optimizer uses hard-coded sample sizes per operation type, defined in `SAMPLE_SIZE_MAP`.
- Value:
- `map`: 5 items
- `filter`: 5 items
- `split`: 10 items
- `gather`: 10 items
- `unnest`: 10 items
- `reduce`: 40 items
- `resolve`: 100 items
- `equijoin`: 100 items
- Trade-off: Smaller samples = faster/cheaper optimization but may miss edge cases. Larger samples = more accurate but higher LLM cost.
- Override: Sample sizes can be overridden per-operation in the pipeline YAML config via `optimize` flag or optimizer_config.
Reasoning
The sample sizes reflect the inherent complexity of each operation:
- Map/Filter (5): These operate on individual documents independently. A small sample is sufficient to evaluate prompt quality since each document is processed in isolation.
- Split/Gather/Unnest (10): These involve structural transformations. Slightly more samples help capture variation in document structure (e.g., different section counts, nesting depths).
- Reduce (40): Reduce aggregates multiple items into one, so the optimizer needs enough groups of items to evaluate how well the fold/merge prompts work across different group sizes and compositions.
- Resolve/Equijoin (100): These involve pairwise comparisons (O(n^2) potential pairs). 100 items provide enough diversity to find meaningful blocking thresholds and comparison prompts. The optimizer also uses these samples for automatic threshold calibration.
Code Evidence
From `docetl/optimizer.py:36-45`:
SAMPLE_SIZE_MAP = {
"reduce": 40,
"map": 5,
"resolve": 100,
"equijoin": 100,
"filter": 5,
"split": 10,
"gather": 10,
"unnest": 10,
}