Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Ucbepic Docetl InstantiateSchemas

From Leeroopedia


Knowledge Sources
Domains Data_Processing, Optimization, Schema_Validation
Last Updated 2026-02-08 00:00 GMT

Overview

Concrete tool for Pydantic schema classes that validate and structure optimizer directive instantiations provided by DocETL.

Description

The instantiate_schemas module defines a comprehensive set of Pydantic BaseModel classes used to validate the structured outputs produced by an LLM when instantiating optimizer directives. Each schema class corresponds to a specific rewrite directive (e.g., chaining, gleaning, model changes, document summarization, subtask isolation, document compression, operator fusion, chunking). The schemas enforce that LLM-generated prompts contain correct Jinja2 template references, output keys match expectations, and code blocks define required functions. Helper functions extract_for_variable and create_dynamic_pattern support template variable extraction for reduce operations.

Usage

Use these schemas when the reasoning optimizer or MOAR search generates new pipeline configurations from directives. They serve as structured output formats for LLM calls and validation layers before applying rewrites to pipeline YAML.

Code Reference

Source Location

Signature

def extract_for_variable(template_content) -> str | None: ...
def create_dynamic_pattern(new_key, template_content) -> tuple[str, str | None]: ...

class MapOpConfig(BaseModel):
    name: str
    prompt: str
    output_keys: List[str]

class ChainingInstantiateSchema(BaseModel):
    new_ops: List[MapOpConfig]

class GleaningInstantiateSchema(BaseModel):
    validation_prompt: str
    num_rounds: int
    model: str = "gpt-4o-mini"

class ChangeModelInstantiateSchema(BaseModel):
    model: str = "gpt-4o-mini"

class DocSummarizationInstantiateSchema(BaseModel):
    name: str
    document_key: str
    prompt: str
    model: str = "gpt-4o-mini"

class SubtaskConfig(BaseModel):
    name: str
    prompt: str
    output_keys: List[str]

class IsolatingSubtasksInstantiateSchema(BaseModel):
    subtasks: List[SubtaskConfig]
    aggregation_prompt: str = ""

class DocCompressionInstantiateSchema(BaseModel):
    name: str
    document_key: str
    prompt: str
    model: str = "gpt-4o-mini"

class DeterministicDocCompressionInstantiateSchema(BaseModel):
    name: str
    code: str

class OperatorFusionInstantiateSchema(BaseModel):
    fused_prompt: str

class ChunkSubsectionConfig(BaseModel):
    count: int

Import

from docetl.reasoning_optimizer.instantiate_schemas import (
    MapOpConfig,
    ChainingInstantiateSchema,
    GleaningInstantiateSchema,
    ChangeModelInstantiateSchema,
    DocSummarizationInstantiateSchema,
    IsolatingSubtasksInstantiateSchema,
    DocCompressionInstantiateSchema,
    DeterministicDocCompressionInstantiateSchema,
    OperatorFusionInstantiateSchema,
)

I/O Contract

Inputs

Name Type Required Description
name str Yes Name of the operator or directive being instantiated
prompt str Yes Jinja2 template prompt (must contain Template:Input.key references for map operations)
output_keys List[str] Yes Keys produced by the operation, referenced downstream
validation_prompt str Yes Prompt for gleaning validation (must NOT contain Jinja variables)
model str No LLM model name (default: "gpt-4o-mini")
code str Yes Python code with a transform function (for deterministic compression)
document_key str Yes Input key containing long content to summarize or compress

Outputs

Name Type Description
validated_schema BaseModel A validated Pydantic model instance ready for directive instantiation
validation_errors ValueError Raised when prompts, keys, or code do not meet constraints

Usage Examples

from docetl.reasoning_optimizer.instantiate_schemas import (
    ChainingInstantiateSchema,
    MapOpConfig,
)

# Validate a chaining directive with two map operations
chain = ChainingInstantiateSchema(
    new_ops=[
        MapOpConfig(
            name="extract_entities",
            prompt="Extract named entities from {{ input.text }}",
            output_keys=["entities"],
        ),
        MapOpConfig(
            name="classify_entities",
            prompt="Classify these entities: {{ input.entities }}",
            output_keys=["classified_entities"],
        ),
    ]
)

# Validate the chain covers required input/output keys
ChainingInstantiateSchema.validate_chain(
    chain.new_ops,
    required_input_keys=["text"],
    expected_output_keys=["classified_entities"],
)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment