Workflow:Ucbepic Docetl YAML Pipeline Execution

Knowledge Sources	DocETL DocETL Documentation Mining Product Reviews Tutorial DocETL Paper
Domains	LLM_Ops, Data_Engineering, ETL
Last Updated	2026-02-08 03:00 GMT

Overview

End-to-end process for defining and executing LLM-powered data processing pipelines using DocETL's YAML configuration language.

Description

This workflow covers the primary use case of DocETL: authoring a declarative YAML pipeline that loads a dataset, applies a sequence of LLM-powered operations (map, unnest, resolve, reduce, filter, and others), and writes structured output. The pipeline engine (DSLRunner) parses the YAML, validates schemas, builds a pull-based execution DAG, and runs operations sequentially within each step. Intermediate results are checkpointed to disk for resumability. All LLM calls go through a unified API abstraction (LiteLLM) supporting multiple providers with retry logic, token counting, caching, and rate limiting.

Usage

Execute this workflow when you have a dataset of unstructured or semi-structured documents (JSON) and need to extract, transform, resolve, or aggregate information using LLM prompts. Typical scenarios include extracting entities from medical transcripts, analyzing product reviews, processing debate transcripts, or classifying documents. The input is a JSON file and a YAML pipeline configuration; the output is a structured JSON file with the processed results.

Execution Steps

Step 1: Prepare Dataset

Organize source documents into a JSON file as a list of objects, where each object represents one document with its fields (e.g., a "src" key containing text content). Place this file in the project directory alongside a .env file with the required LLM API key (e.g., OPENAI_API_KEY).

Key considerations:

Each document should have a consistent schema across all items
Optional parsing tools can be used to convert PDFs, audio, or other formats to text at load time
Large datasets can be sampled during development using the sample parameter on operations

Step 2: Define Pipeline Configuration

Create a YAML file that declares datasets, a default LLM model, an optional system prompt, the operations to apply, and the pipeline steps connecting them. Each operation specifies its type (map, reduce, filter, resolve, unnest, equijoin, split, gather, cluster, sample, rank, topk, extract, scan, link_resolve), a Jinja2 prompt template, and an output schema defining the expected JSON structure.

Key considerations:

Operations are defined once and referenced by name in pipeline steps
Jinja2 templates allow dynamic injection of document fields into prompts
Output schemas use Python type syntax (str, list[str], list[dict]) for structured LLM output extraction
System prompts can provide dataset-level context and persona guidance to all operations

Step 3: Validate and Execute Pipeline

Run the pipeline using the CLI command (docetl run pipeline.yaml). The runner performs syntax validation on all operations, then executes each step sequentially. Within a step, operations run in order: each operation processes the full dataset output of the previous operation. The runner tracks costs, logs progress, and saves intermediate results to the configured intermediate directory.

Key considerations:

Intermediate results enable resuming from the last completed operation on re-run
LLM responses are cached to disk to avoid redundant API calls
Cost tracking is displayed per-operation and per-step at completion
Rate limiting prevents exceeding provider API quotas

Step 4: Review Output

Inspect the output JSON file and intermediate results. Each document in the output contains the original fields plus any fields added by the operations. Evaluate whether the LLM extractions meet quality expectations. If results are unsatisfactory, iterate on prompts, adjust output schemas, add validation rules, or enable gleaning (multi-round validation) on operations.

Key considerations:

Intermediate files show the output after each operation for debugging
The sample parameter allows testing on a subset before running on the full dataset
Prompts can be refined iteratively without re-processing already cached operations

Execution Diagram

GitHub URL

Workflow Repository