Heuristic:Ucbepic Docetl Reduce Parallel Fold Tuning

Knowledge Sources	DocETL Internal
Domains	Optimization, LLM_Pipelines
Last Updated	2026-02-08 01:00 GMT

Overview

Dynamic parallelism tuning for reduce operations using observed fold/merge timing to balance parallel fold streams against merge overhead.

Description

The reduce operation processes groups of items by iteratively folding items into an accumulator, then merging multiple fold results. The number of parallel fold streams is calculated dynamically based on observed timing data, using the formula:

`num_parallel_folds = (fold_time * num_items * log(merge_batch_size)) / (fold_batch_size * merge_time)`

This ensures that fold processing and merge processing take roughly equal time, preventing either phase from becoming a bottleneck. The system requires a minimum of 5 timing samples before adjusting parallelism, and keeps a circular buffer of up to 1000 samples.

Usage

Use this heuristic when tuning reduce operation performance or debugging slow reduce operations. If reduce operations are slow, check whether the fold or merge phase is the bottleneck. The system auto-tunes, but extreme imbalances (very fast folds with slow merges, or vice versa) may indicate a prompt design issue.

The Insight (Rule of Thumb)

Action: The reduce operation automatically tunes its parallelism level based on measured fold and merge times.
Value:
- Minimum samples before tuning: 5
- Maximum samples tracked: 1000 (circular buffer)
- Formula: `max(1, fold_time * group_size * log(merge_batch) / (fold_batch * merge_time))`
Trade-off: More parallel folds = faster processing of large groups, but more merge work. Fewer parallel folds = less merge overhead, but slower fold phase.
Key insight: The logarithmic factor (`log(merge_batch_size)`) accounts for the fact that merging cost grows sub-linearly with batch size.

Reasoning

The reduce operation has a fundamental tension: folding items one-by-one is slow but produces a single result. Folding in parallel is fast but produces multiple results that must be merged. The optimal parallelism depends on the relative cost of fold vs. merge operations, which varies by:

Prompt complexity: Some fold prompts are much more expensive than their merge counterparts
Data size: Larger items take longer to fold
Model speed: Different LLM models have different latencies

Rather than requiring manual tuning, the system observes actual timing and adapts. The 5-sample minimum prevents premature optimization from noisy initial measurements.

Code Evidence

Timing collection initialization from `docetl/operations/reduce.py:189-192`:

self.min_samples = 5
self.max_samples = 1000
self.fold_times = deque(maxlen=self.max_samples)
self.merge_times = deque(maxlen=self.max_samples)

Dynamic parallelism calculation from `docetl/operations/reduce.py:536-549`:

def calculate_num_parallel_folds():
    fold_time, fold_default = self.get_fold_time()
    merge_time, merge_default = self.get_merge_time()
    num_group_items = len(group_list)
    return (
        max(
            1,
            int(
                (fold_time * num_group_items * math.log(merge_batch_size))
                / (fold_batch_size * merge_time)
            ),
        ),
        fold_default or merge_default,
    )

Related Pages

Implementation:Ucbepic_Docetl_ReduceOperation_Execute

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment