Principle:Ucbepic Docetl Pipeline Execution
| Knowledge Sources | |
|---|---|
| Domains | Data_Engineering, Pipeline_Orchestration |
| Last Updated | 2026-02-08 01:40 GMT |
Overview
An orchestration principle that validates, loads, executes, and saves results for a complete data processing pipeline defined by declarative configuration.
Description
Pipeline Execution is the process of taking a validated pipeline configuration and running it end-to-end: loading datasets, executing each operation in dependency order, tracking costs, and saving final results. In DocETL, this is orchestrated by the DSLRunner class which:
- Validates all operations via syntax checking before execution
- Builds a pull-based execution DAG (Directed Acyclic Graph) using OpContainer objects
- Executes operations lazily—each operation pulls data from its predecessors only when needed
- Tracks LLM API costs across all operations
- Supports checkpointing to intermediate files for fault tolerance
The pull-based DAG model means that operations are evaluated in reverse order from the output, ensuring that only necessary computations are performed.
Usage
Use this principle when you need to run a complete DocETL pipeline from configuration to results. This is the core execution flow whether invoked via CLI (docetl run), Python API (Pipeline.run()), or the web playground.
Theoretical Basis
Pipeline execution follows a pull-based lazy evaluation model:
- Configuration Parsing: Parse YAML/dict into validated schema objects
- DAG Construction: Build operation dependency graph as OpContainer tree
- Syntax Validation: Check all operations before execution
- Lazy Evaluation: Pull data through DAG starting from output node
- Cost Tracking: Aggregate LLM API costs across all operations
- Result Persistence: Save final output to configured destination
# Pseudo-code for pipeline execution
runner = DSLRunner(config)
runner.syntax_check() # Validate before running
runner.load() # Load all datasets
output = runner.last_op.next() # Pull-based execution
runner.save(output) # Persist results
return runner.total_cost # Report cost