Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Ucbepic Docetl Pipeline Run

From Leeroopedia


Knowledge Sources
Domains Data_Engineering, Pipeline_Orchestration
Last Updated 2026-02-08 01:40 GMT

Overview

Concrete Python API class for assembling and running DocETL pipelines programmatically.

Description

The Pipeline class accepts datasets, operations, steps, and output configuration as constructor arguments, then provides run() for execution, optimize() for automated optimization, and to_yaml() for YAML export. Internally, it converts to a dict and delegates to DSLRunner.

Usage

Use Pipeline when building pipelines in Python code rather than YAML. Commonly used in Jupyter notebooks, test suites, and applications that generate pipelines dynamically.

Code Reference

Source Location

  • Repository: docetl
  • File: docetl/api.py
  • Lines: L84-253

Signature

class Pipeline:
    def __init__(
        self,
        name: str,
        datasets: dict[str, Dataset],
        operations: list[OpType],
        steps: list[PipelineStep],
        output: PipelineOutput,
        parsing_tools: list[ParsingTool | Callable] = [],
        default_model: str | None = None,
        rate_limits: dict[str, int] | None = None,
        optimizer_config: dict[str, Any] = {},
        **kwargs,
    ):
        """Assemble a pipeline from components."""

    def run(self, max_threads: int | None = None) -> float:
        """Execute pipeline. Returns total LLM cost."""

    def to_yaml(self, path: str) -> None:
        """Export pipeline to YAML file."""

Import

from docetl.api import Pipeline
from docetl.schemas import MapOp, ReduceOp, Dataset
from docetl.base_schemas import PipelineStep, PipelineOutput

I/O Contract

Inputs

Name Type Required Description
name str Yes Pipeline name
datasets dict[str, Dataset] Yes Named dataset objects
operations list[OpType] Yes Operation schema objects
steps list[PipelineStep] Yes Pipeline execution steps
output PipelineOutput Yes Output configuration
default_model str or None No Fallback LLM model

Outputs

Name Type Description
run() returns float Total LLM API cost
output file JSON or CSV Results at configured output path

Usage Examples

from docetl.api import Pipeline
from docetl.schemas import MapOp, ReduceOp, Dataset
from docetl.base_schemas import PipelineStep, PipelineOutput

pipeline = Pipeline(
    name="my_pipeline",
    datasets={"input": Dataset(type="file", path="data.json")},
    operations=[
        MapOp(name="extract", type="map",
              prompt="Extract entities: {{ input.text }}",
              output={"schema": {"entities": "list[str]"}}),
    ],
    steps=[PipelineStep(name="step1", input="input", operations=["extract"])],
    output=PipelineOutput(type="file", path="output.json"),
    default_model="gpt-4o-mini",
)

cost = pipeline.run()
print(f"Cost: ${cost:.2f}")

Related Pages

Implements Principle

Requires Environment

Uses Heuristic

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment