Heuristic:Ucbepic Docetl Reduce Parallel Fold Tuning
| Knowledge Sources | |
|---|---|
| Domains | Optimization, LLM_Pipelines |
| Last Updated | 2026-02-08 01:00 GMT |
Overview
Dynamic parallelism tuning for reduce operations using observed fold/merge timing to balance parallel fold streams against merge overhead.
Description
The reduce operation processes groups of items by iteratively folding items into an accumulator, then merging multiple fold results. The number of parallel fold streams is calculated dynamically based on observed timing data, using the formula:
`num_parallel_folds = (fold_time * num_items * log(merge_batch_size)) / (fold_batch_size * merge_time)`
This ensures that fold processing and merge processing take roughly equal time, preventing either phase from becoming a bottleneck. The system requires a minimum of 5 timing samples before adjusting parallelism, and keeps a circular buffer of up to 1000 samples.
Usage
Use this heuristic when tuning reduce operation performance or debugging slow reduce operations. If reduce operations are slow, check whether the fold or merge phase is the bottleneck. The system auto-tunes, but extreme imbalances (very fast folds with slow merges, or vice versa) may indicate a prompt design issue.
The Insight (Rule of Thumb)
- Action: The reduce operation automatically tunes its parallelism level based on measured fold and merge times.
- Value:
- Minimum samples before tuning: 5
- Maximum samples tracked: 1000 (circular buffer)
- Formula: `max(1, fold_time * group_size * log(merge_batch) / (fold_batch * merge_time))`
- Trade-off: More parallel folds = faster processing of large groups, but more merge work. Fewer parallel folds = less merge overhead, but slower fold phase.
- Key insight: The logarithmic factor (`log(merge_batch_size)`) accounts for the fact that merging cost grows sub-linearly with batch size.
Reasoning
The reduce operation has a fundamental tension: folding items one-by-one is slow but produces a single result. Folding in parallel is fast but produces multiple results that must be merged. The optimal parallelism depends on the relative cost of fold vs. merge operations, which varies by:
- Prompt complexity: Some fold prompts are much more expensive than their merge counterparts
- Data size: Larger items take longer to fold
- Model speed: Different LLM models have different latencies
Rather than requiring manual tuning, the system observes actual timing and adapts. The 5-sample minimum prevents premature optimization from noisy initial measurements.
Code Evidence
Timing collection initialization from `docetl/operations/reduce.py:189-192`:
self.min_samples = 5
self.max_samples = 1000
self.fold_times = deque(maxlen=self.max_samples)
self.merge_times = deque(maxlen=self.max_samples)
Dynamic parallelism calculation from `docetl/operations/reduce.py:536-549`:
def calculate_num_parallel_folds():
fold_time, fold_default = self.get_fold_time()
merge_time, merge_default = self.get_merge_time()
num_group_items = len(group_list)
return (
max(
1,
int(
(fold_time * num_group_items * math.log(merge_batch_size))
/ (fold_batch_size * merge_time)
),
),
fold_default or merge_default,
)