Workflow:Ucbepic Docetl Pipeline Optimization

Knowledge Sources	DocETL MOAR Optimization Guide V1 Optimization Overview DocETL Paper
Domains	LLM_Ops, Pipeline_Optimization, MCTS
Last Updated	2026-02-08 03:00 GMT

Overview

End-to-end process for automatically optimizing DocETL pipelines to improve accuracy, reduce cost, or both, using the V1 rule-based optimizer or the V2 MOAR (Multi-Objective Agentic Rewrites) optimizer.

Description

This workflow covers how to take an existing DocETL pipeline and automatically optimize it. DocETL provides two optimization approaches. The V1 optimizer uses rule-based decomposition: it analyzes map, reduce, and resolve operations, generates candidate plans (e.g., chunking strategies for long documents, blocking thresholds for entity resolution), evaluates them against sample data, and selects the best configuration. The V2 MOAR optimizer uses Monte Carlo Tree Search (MCTS) with a reasoning agent that applies 25+ directives (chaining, gleaning, model swapping, operator fusion, chunking, compression, etc.) to explore a tree of pipeline rewrites. MOAR tracks a Pareto frontier of accuracy vs. cost, producing multiple optimal pipeline configurations.

Usage

Execute this workflow when your pipeline produces results that lack depth or accuracy, when documents exceed LLM context windows, when entity resolution needs automatic threshold tuning, or when you want to minimize cost while maintaining quality. The V1 optimizer is suitable for quick improvements (chunking, threshold tuning); MOAR is suitable for comprehensive multi-objective optimization with an evaluation function.

Execution Steps

Step 1: Author Baseline Pipeline

Start with a working YAML pipeline that runs correctly but may not produce optimal results. Mark specific operations for optimization by adding the "optimize: true" flag. For MOAR, the pipeline needs no special flags beyond the optimizer_config section.

Key considerations:

The pipeline must run successfully before optimization
Mark only operations that need improvement with optimize: true (V1)
For MOAR, prepare a sample or hold-out dataset to avoid optimizing on test data

Step 2: Write Evaluation Function (MOAR only)

For MOAR optimization, create a Python file with an evaluation function decorated with @register_eval. This function receives the pipeline output file path and the original dataset file path, computes evaluation metrics (precision, recall, accuracy scores, etc.), and returns a dictionary of named metrics. The optimizer_config specifies which metric key to optimize.

Key considerations:

The function must take exactly two arguments: dataset_file_path and results_file_path
Return a dictionary of numeric metrics
The metric_key in optimizer_config must match a key in the returned dictionary
Only one function per file can be decorated with @register_eval

Step 3: Configure Optimizer

For V1, simply run the build command; the optimizer reads optimize: true flags from operations. For MOAR, add an optimizer_config section to the YAML specifying the optimizer type, save directory, available models, evaluation file, metric key, maximum iterations, and the rewrite agent model.

Key considerations:

V1 requires no additional configuration beyond optimize flags
MOAR's available_models list determines which LLM models the optimizer can try
max_iterations controls the depth of the MCTS search (more iterations yield better results but cost more)
The rewrite_agent_model should be a capable model (e.g., gpt-4o or gpt-5.1) for generating rewrite strategies

Step 4: Run Optimization

Execute the build command via CLI (docetl build pipeline.yaml for V1, or docetl build pipeline.yaml --optimizer moar for MOAR). The optimizer explores candidate configurations, evaluates them, and writes the optimized pipeline to a new YAML file. V1 produces a single optimized pipeline; MOAR produces a Pareto frontier of solutions trading off accuracy and cost.

Key considerations:

V1 optimization cost depends on the number of candidate plans evaluated (can cost $20+ for complex operations)
MOAR optimization cost depends on max_iterations and the models used (can cost significantly more)
Both optimizers support resuming from interrupted runs
Results are saved to the configured save directory

Step 5: Review and Run Optimized Pipeline

Inspect the generated optimized YAML file(s). The optimizer may have added split/gather operations for chunking, adjusted blocking thresholds, added gleaning rounds, changed models, or restructured operations. Run the optimized pipeline with docetl run and compare output quality and cost against the baseline.

Key considerations:

MOAR produces multiple pipeline variants on the Pareto frontier; choose based on your accuracy/cost preference
Review any synthesized prompts (validation prompts, comparison prompts) and edit them if needed
The experiment_summary.json (MOAR) provides a high-level summary of the optimization run

Execution Diagram

GitHub URL

Workflow Repository