Principle:Ucbepic Docetl Pipeline Execution

Knowledge Sources	DocETL Docs DocETL
Domains	Data_Engineering, Pipeline_Orchestration
Last Updated	2026-02-08 01:40 GMT

Overview

An orchestration principle that validates, loads, executes, and saves results for a complete data processing pipeline defined by declarative configuration.

Description

Pipeline Execution is the process of taking a validated pipeline configuration and running it end-to-end: loading datasets, executing each operation in dependency order, tracking costs, and saving final results. In DocETL, this is orchestrated by the DSLRunner class which:

Validates all operations via syntax checking before execution
Builds a pull-based execution DAG (Directed Acyclic Graph) using OpContainer objects
Executes operations lazily—each operation pulls data from its predecessors only when needed
Tracks LLM API costs across all operations
Supports checkpointing to intermediate files for fault tolerance

The pull-based DAG model means that operations are evaluated in reverse order from the output, ensuring that only necessary computations are performed.

Usage

Use this principle when you need to run a complete DocETL pipeline from configuration to results. This is the core execution flow whether invoked via CLI (docetl run), Python API (Pipeline.run()), or the web playground.

Theoretical Basis

Pipeline execution follows a pull-based lazy evaluation model:

Configuration Parsing: Parse YAML/dict into validated schema objects
DAG Construction: Build operation dependency graph as OpContainer tree
Syntax Validation: Check all operations before execution
Lazy Evaluation: Pull data through DAG starting from output node
Cost Tracking: Aggregate LLM API costs across all operations
Result Persistence: Save final output to configured destination

# Pseudo-code for pipeline execution
runner = DSLRunner(config)
runner.syntax_check()        # Validate before running
runner.load()                # Load all datasets
output = runner.last_op.next()  # Pull-based execution
runner.save(output)          # Persist results
return runner.total_cost     # Report cost

Related Pages

Implemented By

Implementation:Ucbepic_Docetl_DSLRunner_Load_Run_Save

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment