Implementation:Ucbepic Docetl Pipeline Schema Definition
Appearance
| Knowledge Sources | |
|---|---|
| Domains | Data_Engineering, Configuration |
| Last Updated | 2026-02-08 01:40 GMT |
Overview
Concrete Pydantic schema classes for defining DocETL pipeline structure including steps, outputs, and operation types.
Description
DocETL uses Pydantic BaseModel subclasses to define the structure of pipeline configurations. The three key classes are:
- PipelineStep: Defines a named step with an ordered list of operations and an optional input source
- PipelineOutput: Specifies output type (file), path, and optional intermediate directory
- PipelineSpec: Combines steps and output into a complete pipeline specification
These schemas are used for both YAML parsing validation and Python API object construction.
Usage
Use these schema classes when programmatically constructing pipeline configurations via the Python API, or when understanding the structure expected by YAML pipeline files.
Code Reference
Source Location
- Repository: docetl
- File: docetl/base_schemas.py
- Lines: L49-131
Signature
class PipelineStep(BaseModel):
name: str
operations: list[dict[str, Any] | str]
input: str | None = None
class PipelineOutput(BaseModel):
type: str
path: str
intermediate_dir: str | None = None
class PipelineSpec(BaseModel):
steps: list[PipelineStep]
output: PipelineOutput
Import
from docetl.base_schemas import PipelineStep, PipelineOutput, PipelineSpec
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| name | str | Yes | Step name identifier |
| operations | list[dict or str] | Yes | Ordered list of operation names or configs |
| input | str or None | No | Dataset name or previous step name |
| type | str | Yes | Output type (e.g., "file") |
| path | str | Yes | Output file path |
| intermediate_dir | str or None | No | Directory for intermediate results |
Outputs
| Name | Type | Description |
|---|---|---|
| PipelineStep | BaseModel | Validated pipeline step object |
| PipelineOutput | BaseModel | Validated output configuration |
| PipelineSpec | BaseModel | Complete pipeline specification |
Usage Examples
YAML Pipeline Configuration
pipeline:
steps:
- name: process_step
input: input_dataset
operations:
- extract_info
- summarize
- name: deduplicate_step
input: process_step
operations:
- resolve_entities
output:
type: file
path: output/results.json
intermediate_dir: output/intermediates
Python API Usage
from docetl.base_schemas import PipelineStep, PipelineOutput
step = PipelineStep(
name="process_step",
input="input_data",
operations=["extract", "summarize"]
)
output = PipelineOutput(
type="file",
path="output/results.json",
intermediate_dir="output/intermediates"
)
Related Pages
Implements Principle
Page Connections
Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment