Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:Ucbepic Docetl Pipeline Execution

From Leeroopedia


Knowledge Sources
Domains Data_Engineering, Pipeline_Orchestration
Last Updated 2026-02-08 01:40 GMT

Overview

An orchestration principle that validates, loads, executes, and saves results for a complete data processing pipeline defined by declarative configuration.

Description

Pipeline Execution is the process of taking a validated pipeline configuration and running it end-to-end: loading datasets, executing each operation in dependency order, tracking costs, and saving final results. In DocETL, this is orchestrated by the DSLRunner class which:

  • Validates all operations via syntax checking before execution
  • Builds a pull-based execution DAG (Directed Acyclic Graph) using OpContainer objects
  • Executes operations lazily—each operation pulls data from its predecessors only when needed
  • Tracks LLM API costs across all operations
  • Supports checkpointing to intermediate files for fault tolerance

The pull-based DAG model means that operations are evaluated in reverse order from the output, ensuring that only necessary computations are performed.

Usage

Use this principle when you need to run a complete DocETL pipeline from configuration to results. This is the core execution flow whether invoked via CLI (docetl run), Python API (Pipeline.run()), or the web playground.

Theoretical Basis

Pipeline execution follows a pull-based lazy evaluation model:

  1. Configuration Parsing: Parse YAML/dict into validated schema objects
  2. DAG Construction: Build operation dependency graph as OpContainer tree
  3. Syntax Validation: Check all operations before execution
  4. Lazy Evaluation: Pull data through DAG starting from output node
  5. Cost Tracking: Aggregate LLM API costs across all operations
  6. Result Persistence: Save final output to configured destination
# Pseudo-code for pipeline execution
runner = DSLRunner(config)
runner.syntax_check()        # Validate before running
runner.load()                # Load all datasets
output = runner.last_op.next()  # Pull-based execution
runner.save(output)          # Persist results
return runner.total_cost     # Report cost

Related Pages

Implemented By

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment