Implementation:Ucbepic Docetl InstantiateSchemas
| Knowledge Sources | |
|---|---|
| Domains | Data_Processing, Optimization, Schema_Validation |
| Last Updated | 2026-02-08 00:00 GMT |
Overview
Concrete tool for Pydantic schema classes that validate and structure optimizer directive instantiations provided by DocETL.
Description
The instantiate_schemas module defines a comprehensive set of Pydantic BaseModel classes used to validate the structured outputs produced by an LLM when instantiating optimizer directives. Each schema class corresponds to a specific rewrite directive (e.g., chaining, gleaning, model changes, document summarization, subtask isolation, document compression, operator fusion, chunking). The schemas enforce that LLM-generated prompts contain correct Jinja2 template references, output keys match expectations, and code blocks define required functions. Helper functions extract_for_variable and create_dynamic_pattern support template variable extraction for reduce operations.
Usage
Use these schemas when the reasoning optimizer or MOAR search generates new pipeline configurations from directives. They serve as structured output formats for LLM calls and validation layers before applying rewrites to pipeline YAML.
Code Reference
Source Location
- Repository: Ucbepic_Docetl
- File: docetl/reasoning_optimizer/instantiate_schemas.py
- Lines: 1-1335
Signature
def extract_for_variable(template_content) -> str | None: ...
def create_dynamic_pattern(new_key, template_content) -> tuple[str, str | None]: ...
class MapOpConfig(BaseModel):
name: str
prompt: str
output_keys: List[str]
class ChainingInstantiateSchema(BaseModel):
new_ops: List[MapOpConfig]
class GleaningInstantiateSchema(BaseModel):
validation_prompt: str
num_rounds: int
model: str = "gpt-4o-mini"
class ChangeModelInstantiateSchema(BaseModel):
model: str = "gpt-4o-mini"
class DocSummarizationInstantiateSchema(BaseModel):
name: str
document_key: str
prompt: str
model: str = "gpt-4o-mini"
class SubtaskConfig(BaseModel):
name: str
prompt: str
output_keys: List[str]
class IsolatingSubtasksInstantiateSchema(BaseModel):
subtasks: List[SubtaskConfig]
aggregation_prompt: str = ""
class DocCompressionInstantiateSchema(BaseModel):
name: str
document_key: str
prompt: str
model: str = "gpt-4o-mini"
class DeterministicDocCompressionInstantiateSchema(BaseModel):
name: str
code: str
class OperatorFusionInstantiateSchema(BaseModel):
fused_prompt: str
class ChunkSubsectionConfig(BaseModel):
count: int
Import
from docetl.reasoning_optimizer.instantiate_schemas import (
MapOpConfig,
ChainingInstantiateSchema,
GleaningInstantiateSchema,
ChangeModelInstantiateSchema,
DocSummarizationInstantiateSchema,
IsolatingSubtasksInstantiateSchema,
DocCompressionInstantiateSchema,
DeterministicDocCompressionInstantiateSchema,
OperatorFusionInstantiateSchema,
)
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| name | str | Yes | Name of the operator or directive being instantiated |
| prompt | str | Yes | Jinja2 template prompt (must contain Template:Input.key references for map operations) |
| output_keys | List[str] | Yes | Keys produced by the operation, referenced downstream |
| validation_prompt | str | Yes | Prompt for gleaning validation (must NOT contain Jinja variables) |
| model | str | No | LLM model name (default: "gpt-4o-mini") |
| code | str | Yes | Python code with a transform function (for deterministic compression) |
| document_key | str | Yes | Input key containing long content to summarize or compress |
Outputs
| Name | Type | Description |
|---|---|---|
| validated_schema | BaseModel | A validated Pydantic model instance ready for directive instantiation |
| validation_errors | ValueError | Raised when prompts, keys, or code do not meet constraints |
Usage Examples
from docetl.reasoning_optimizer.instantiate_schemas import (
ChainingInstantiateSchema,
MapOpConfig,
)
# Validate a chaining directive with two map operations
chain = ChainingInstantiateSchema(
new_ops=[
MapOpConfig(
name="extract_entities",
prompt="Extract named entities from {{ input.text }}",
output_keys=["entities"],
),
MapOpConfig(
name="classify_entities",
prompt="Classify these entities: {{ input.entities }}",
output_keys=["classified_entities"],
),
]
)
# Validate the chain covers required input/output keys
ChainingInstantiateSchema.validate_chain(
chain.new_ops,
required_input_keys=["text"],
expected_output_keys=["classified_entities"],
)