Implementation:Ucbepic Docetl BaseSchemas
| Knowledge Sources | |
|---|---|
| Domains | Data_Processing, Schema_Validation |
| Last Updated | 2026-02-08 00:00 GMT |
Overview
Concrete tool for foundational Pydantic data models that define the structure of DocETL pipelines provided by DocETL.
Description
The base_schemas module defines the core Pydantic models used to represent and validate DocETL pipeline configurations. It includes ToolFunction and Tool for LLM tool definitions, ParsingTool for custom data parsing functions, PipelineStep for individual processing steps with their operations, PipelineOutput for output configuration (type, path, intermediate directory), and PipelineSpec that composes steps and output into a complete pipeline specification. These models are used throughout the codebase to ensure pipeline configurations conform to expected structures.
Usage
Use these schemas when parsing, validating, or constructing DocETL pipeline configurations programmatically. They are the canonical type definitions for pipeline structure elements.
Code Reference
Source Location
- Repository: Ucbepic_Docetl
- File: docetl/base_schemas.py
- Lines: 1-130
Signature
class ToolFunction(BaseModel):
name: str
description: str
parameters: dict[str, Any]
class Tool(BaseModel):
code: str
function: ToolFunction
class ParsingTool(BaseModel):
name: str
function_code: str
class PipelineStep(BaseModel):
name: str
operations: list[dict[str, Any] | str]
input: str | None = None
class PipelineOutput(BaseModel):
type: str
path: str
intermediate_dir: str | None = None
class PipelineSpec(BaseModel):
steps: list[PipelineStep]
output: PipelineOutput
Import
from docetl.base_schemas import (
ToolFunction,
Tool,
ParsingTool,
PipelineStep,
PipelineOutput,
PipelineSpec,
)
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| name | str | Yes | Name of the tool, parsing tool, or pipeline step |
| operations | list[dict or str] | Yes | List of operation names or operation config dicts for a step |
| input | str or None | No | Input dataset name or previous step name (None uses previous step output) |
| type | str | Yes | Output type (e.g., "file") |
| path | str | Yes | Output file path |
| intermediate_dir | str or None | No | Directory for intermediate results |
| function_code | str | Yes | Python code defining a parsing function (for ParsingTool) |
Outputs
| Name | Type | Description |
|---|---|---|
| validated_model | BaseModel | A validated Pydantic model instance representing the pipeline element |
Usage Examples
from docetl.base_schemas import PipelineStep, PipelineOutput, PipelineSpec
# Define a pipeline step
step = PipelineStep(
name="extract_step",
input="raw_documents",
operations=["extract_entities", "classify_entities"]
)
# Define pipeline output
output = PipelineOutput(
type="file",
path="/output/results.json",
intermediate_dir="/output/intermediates"
)
# Compose into a full pipeline spec
spec = PipelineSpec(steps=[step], output=output)