Workflow:Ucbepic Docetl Python API Pipeline
| Knowledge Sources | |
|---|---|
| Domains | LLM_Ops, Data_Engineering, Python_API |
| Last Updated | 2026-02-08 03:00 GMT |
Overview
End-to-end process for programmatically defining, optimizing, and running DocETL data processing pipelines using the Python API instead of YAML configuration files.
Description
This workflow covers the Python-first approach to building DocETL pipelines. Instead of writing YAML, users construct pipeline objects in Python using typed classes: Dataset, MapOp, ReduceOp, FilterOp, ResolveOp, UnnestOp, CodeMapOp, PipelineStep, PipelineOutput, and Pipeline. The Python API provides the same capabilities as YAML (all operation types, optimization, caching) but enables programmatic pipeline construction, dynamic prompt generation, integration with existing Python codebases, and the use of code operations (CodeMapOp, CodeReduceOp, CodeFilterOp) for deterministic preprocessing steps. The Pipeline object exposes run() and optimize() methods for execution and optimization respectively.
Usage
Execute this workflow when you need to build DocETL pipelines within a larger Python application, when you want to programmatically generate prompts or schemas, when you need code operations for deterministic transformations alongside LLM operations, or when you prefer type-checked Python objects over YAML. This approach also supports the pandas SemanticAccessor (df.semantic) for lightweight LLM operations on DataFrames.
Execution Steps
Step 1: Install DocETL and Configure Environment
Install the docetl package via pip. Set up the LLM API key in a .env file or as an environment variable. Import the necessary classes from docetl.api: Pipeline, Dataset, MapOp, ReduceOp, UnnestOp, ResolveOp, FilterOp, CodeMapOp, PipelineStep, and PipelineOutput.
Key considerations:
- Python 3.10 or later is required
- The OPENAI_API_KEY environment variable must be set (or the appropriate key for your LLM provider)
- The docetl.api module provides all the typed operation classes
Step 2: Define Dataset and Operations
Create a Dataset object pointing to the JSON data file. Instantiate operation objects (MapOp, ReduceOp, etc.) with their parameters: name, type, prompt (Jinja2 template), output schema, and any operation-specific settings. For deterministic preprocessing, use CodeMapOp with a Python function instead of an LLM prompt.
Key considerations:
- Each operation must have a unique name
- Prompt strings use the same Jinja2 template syntax as YAML pipelines
- CodeMapOp accepts a Python function that takes a dict and returns a dict
- Output schemas use string type notation matching the YAML format
Step 3: Assemble and Run Pipeline
Create PipelineStep objects listing operations in execution order. Create a PipelineOutput specifying the output file path and optional intermediate directory. Instantiate the Pipeline with all components (datasets, operations, steps, output, default_model, optional system_prompt). Call pipeline.run() to execute, which returns the total cost.
Key considerations:
- Operations are referenced by name string in PipelineStep, not by object
- The system_prompt dict can include dataset_description and persona fields
- pipeline.run() returns the total LLM API cost as a float
Step 4: Optimize Pipeline (Optional)
Call pipeline.optimize() to invoke the V1 optimizer, which returns an optimized Pipeline object. The optimizer analyzes operations marked for optimization and generates improved configurations (chunking, blocking thresholds, gleaning). Run the optimized pipeline with optimized_pipeline.run().
Key considerations:
- Optimization adds cost for evaluating candidate plans
- The returned optimized pipeline can be inspected to see what changes were made
- For MOAR optimization, use the CLI approach with YAML export
Step 5: Use Pandas Integration (Alternative)
For lightweight operations, use the pandas SemanticAccessor (df.semantic) directly on DataFrames. This provides map, filter, merge, agg, split, gather, and unnest operations as DataFrame methods, avoiding the need to construct a full Pipeline object.
Key considerations:
- Import SemanticAccessor from docetl to register the .semantic accessor
- Each semantic operation returns a new DataFrame
- Suitable for exploratory analysis and simple transformations
- Does not support the full pipeline optimization workflow