Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Workflow:Ucbepic Docetl Python API Pipeline

From Leeroopedia


Knowledge Sources
Domains LLM_Ops, Data_Engineering, Python_API
Last Updated 2026-02-08 03:00 GMT

Overview

End-to-end process for programmatically defining, optimizing, and running DocETL data processing pipelines using the Python API instead of YAML configuration files.

Description

This workflow covers the Python-first approach to building DocETL pipelines. Instead of writing YAML, users construct pipeline objects in Python using typed classes: Dataset, MapOp, ReduceOp, FilterOp, ResolveOp, UnnestOp, CodeMapOp, PipelineStep, PipelineOutput, and Pipeline. The Python API provides the same capabilities as YAML (all operation types, optimization, caching) but enables programmatic pipeline construction, dynamic prompt generation, integration with existing Python codebases, and the use of code operations (CodeMapOp, CodeReduceOp, CodeFilterOp) for deterministic preprocessing steps. The Pipeline object exposes run() and optimize() methods for execution and optimization respectively.

Usage

Execute this workflow when you need to build DocETL pipelines within a larger Python application, when you want to programmatically generate prompts or schemas, when you need code operations for deterministic transformations alongside LLM operations, or when you prefer type-checked Python objects over YAML. This approach also supports the pandas SemanticAccessor (df.semantic) for lightweight LLM operations on DataFrames.

Execution Steps

Step 1: Install DocETL and Configure Environment

Install the docetl package via pip. Set up the LLM API key in a .env file or as an environment variable. Import the necessary classes from docetl.api: Pipeline, Dataset, MapOp, ReduceOp, UnnestOp, ResolveOp, FilterOp, CodeMapOp, PipelineStep, and PipelineOutput.

Key considerations:

  • Python 3.10 or later is required
  • The OPENAI_API_KEY environment variable must be set (or the appropriate key for your LLM provider)
  • The docetl.api module provides all the typed operation classes

Step 2: Define Dataset and Operations

Create a Dataset object pointing to the JSON data file. Instantiate operation objects (MapOp, ReduceOp, etc.) with their parameters: name, type, prompt (Jinja2 template), output schema, and any operation-specific settings. For deterministic preprocessing, use CodeMapOp with a Python function instead of an LLM prompt.

Key considerations:

  • Each operation must have a unique name
  • Prompt strings use the same Jinja2 template syntax as YAML pipelines
  • CodeMapOp accepts a Python function that takes a dict and returns a dict
  • Output schemas use string type notation matching the YAML format

Step 3: Assemble and Run Pipeline

Create PipelineStep objects listing operations in execution order. Create a PipelineOutput specifying the output file path and optional intermediate directory. Instantiate the Pipeline with all components (datasets, operations, steps, output, default_model, optional system_prompt). Call pipeline.run() to execute, which returns the total cost.

Key considerations:

  • Operations are referenced by name string in PipelineStep, not by object
  • The system_prompt dict can include dataset_description and persona fields
  • pipeline.run() returns the total LLM API cost as a float

Step 4: Optimize Pipeline (Optional)

Call pipeline.optimize() to invoke the V1 optimizer, which returns an optimized Pipeline object. The optimizer analyzes operations marked for optimization and generates improved configurations (chunking, blocking thresholds, gleaning). Run the optimized pipeline with optimized_pipeline.run().

Key considerations:

  • Optimization adds cost for evaluating candidate plans
  • The returned optimized pipeline can be inspected to see what changes were made
  • For MOAR optimization, use the CLI approach with YAML export

Step 5: Use Pandas Integration (Alternative)

For lightweight operations, use the pandas SemanticAccessor (df.semantic) directly on DataFrames. This provides map, filter, merge, agg, split, gather, and unnest operations as DataFrame methods, avoiding the need to construct a full Pipeline object.

Key considerations:

  • Import SemanticAccessor from docetl to register the .semantic accessor
  • Each semantic operation returns a new DataFrame
  • Suitable for exploratory analysis and simple transformations
  • Does not support the full pipeline optimization workflow

Execution Diagram

GitHub URL

Workflow Repository