Implementation:Ucbepic Docetl Pipeline Run
Appearance
| Knowledge Sources | |
|---|---|
| Domains | Data_Engineering, Pipeline_Orchestration |
| Last Updated | 2026-02-08 01:40 GMT |
Overview
Concrete Python API class for assembling and running DocETL pipelines programmatically.
Description
The Pipeline class accepts datasets, operations, steps, and output configuration as constructor arguments, then provides run() for execution, optimize() for automated optimization, and to_yaml() for YAML export. Internally, it converts to a dict and delegates to DSLRunner.
Usage
Use Pipeline when building pipelines in Python code rather than YAML. Commonly used in Jupyter notebooks, test suites, and applications that generate pipelines dynamically.
Code Reference
Source Location
- Repository: docetl
- File: docetl/api.py
- Lines: L84-253
Signature
class Pipeline:
def __init__(
self,
name: str,
datasets: dict[str, Dataset],
operations: list[OpType],
steps: list[PipelineStep],
output: PipelineOutput,
parsing_tools: list[ParsingTool | Callable] = [],
default_model: str | None = None,
rate_limits: dict[str, int] | None = None,
optimizer_config: dict[str, Any] = {},
**kwargs,
):
"""Assemble a pipeline from components."""
def run(self, max_threads: int | None = None) -> float:
"""Execute pipeline. Returns total LLM cost."""
def to_yaml(self, path: str) -> None:
"""Export pipeline to YAML file."""
Import
from docetl.api import Pipeline
from docetl.schemas import MapOp, ReduceOp, Dataset
from docetl.base_schemas import PipelineStep, PipelineOutput
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| name | str | Yes | Pipeline name |
| datasets | dict[str, Dataset] | Yes | Named dataset objects |
| operations | list[OpType] | Yes | Operation schema objects |
| steps | list[PipelineStep] | Yes | Pipeline execution steps |
| output | PipelineOutput | Yes | Output configuration |
| default_model | str or None | No | Fallback LLM model |
Outputs
| Name | Type | Description |
|---|---|---|
| run() returns | float | Total LLM API cost |
| output file | JSON or CSV | Results at configured output path |
Usage Examples
from docetl.api import Pipeline
from docetl.schemas import MapOp, ReduceOp, Dataset
from docetl.base_schemas import PipelineStep, PipelineOutput
pipeline = Pipeline(
name="my_pipeline",
datasets={"input": Dataset(type="file", path="data.json")},
operations=[
MapOp(name="extract", type="map",
prompt="Extract entities: {{ input.text }}",
output={"schema": {"entities": "list[str]"}}),
],
steps=[PipelineStep(name="step1", input="input", operations=["extract"])],
output=PipelineOutput(type="file", path="output.json"),
default_model="gpt-4o-mini",
)
cost = pipeline.run()
print(f"Cost: ${cost:.2f}")
Related Pages
Implements Principle
Requires Environment
Uses Heuristic
Page Connections
Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment