Workflow:Ucbepic Docetl Pipeline Optimization
| Knowledge Sources | |
|---|---|
| Domains | LLM_Ops, Pipeline_Optimization, MCTS |
| Last Updated | 2026-02-08 03:00 GMT |
Overview
End-to-end process for automatically optimizing DocETL pipelines to improve accuracy, reduce cost, or both, using the V1 rule-based optimizer or the V2 MOAR (Multi-Objective Agentic Rewrites) optimizer.
Description
This workflow covers how to take an existing DocETL pipeline and automatically optimize it. DocETL provides two optimization approaches. The V1 optimizer uses rule-based decomposition: it analyzes map, reduce, and resolve operations, generates candidate plans (e.g., chunking strategies for long documents, blocking thresholds for entity resolution), evaluates them against sample data, and selects the best configuration. The V2 MOAR optimizer uses Monte Carlo Tree Search (MCTS) with a reasoning agent that applies 25+ directives (chaining, gleaning, model swapping, operator fusion, chunking, compression, etc.) to explore a tree of pipeline rewrites. MOAR tracks a Pareto frontier of accuracy vs. cost, producing multiple optimal pipeline configurations.
Usage
Execute this workflow when your pipeline produces results that lack depth or accuracy, when documents exceed LLM context windows, when entity resolution needs automatic threshold tuning, or when you want to minimize cost while maintaining quality. The V1 optimizer is suitable for quick improvements (chunking, threshold tuning); MOAR is suitable for comprehensive multi-objective optimization with an evaluation function.
Execution Steps
Step 1: Author Baseline Pipeline
Start with a working YAML pipeline that runs correctly but may not produce optimal results. Mark specific operations for optimization by adding the "optimize: true" flag. For MOAR, the pipeline needs no special flags beyond the optimizer_config section.
Key considerations:
- The pipeline must run successfully before optimization
- Mark only operations that need improvement with optimize: true (V1)
- For MOAR, prepare a sample or hold-out dataset to avoid optimizing on test data
Step 2: Write Evaluation Function (MOAR only)
For MOAR optimization, create a Python file with an evaluation function decorated with @register_eval. This function receives the pipeline output file path and the original dataset file path, computes evaluation metrics (precision, recall, accuracy scores, etc.), and returns a dictionary of named metrics. The optimizer_config specifies which metric key to optimize.
Key considerations:
- The function must take exactly two arguments: dataset_file_path and results_file_path
- Return a dictionary of numeric metrics
- The metric_key in optimizer_config must match a key in the returned dictionary
- Only one function per file can be decorated with @register_eval
Step 3: Configure Optimizer
For V1, simply run the build command; the optimizer reads optimize: true flags from operations. For MOAR, add an optimizer_config section to the YAML specifying the optimizer type, save directory, available models, evaluation file, metric key, maximum iterations, and the rewrite agent model.
Key considerations:
- V1 requires no additional configuration beyond optimize flags
- MOAR's available_models list determines which LLM models the optimizer can try
- max_iterations controls the depth of the MCTS search (more iterations yield better results but cost more)
- The rewrite_agent_model should be a capable model (e.g., gpt-4o or gpt-5.1) for generating rewrite strategies
Step 4: Run Optimization
Execute the build command via CLI (docetl build pipeline.yaml for V1, or docetl build pipeline.yaml --optimizer moar for MOAR). The optimizer explores candidate configurations, evaluates them, and writes the optimized pipeline to a new YAML file. V1 produces a single optimized pipeline; MOAR produces a Pareto frontier of solutions trading off accuracy and cost.
Key considerations:
- V1 optimization cost depends on the number of candidate plans evaluated (can cost $20+ for complex operations)
- MOAR optimization cost depends on max_iterations and the models used (can cost significantly more)
- Both optimizers support resuming from interrupted runs
- Results are saved to the configured save directory
Step 5: Review and Run Optimized Pipeline
Inspect the generated optimized YAML file(s). The optimizer may have added split/gather operations for chunking, adjusted blocking thresholds, added gleaning rounds, changed models, or restructured operations. Run the optimized pipeline with docetl run and compare output quality and cost against the baseline.
Key considerations:
- MOAR produces multiple pipeline variants on the Pareto frontier; choose based on your accuracy/cost preference
- Review any synthesized prompts (validation prompts, comparison prompts) and edit them if needed
- The experiment_summary.json (MOAR) provides a high-level summary of the optimization run