Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Ucbepic Docetl ExtractOperation Execute

From Leeroopedia


Knowledge Sources
Domains Data_Processing, Text_Extraction
Last Updated 2026-02-08 00:00 GMT

Overview

Concrete tool for extracting specific text sections from documents using LLM-guided identification strategies, provided by DocETL.

Description

The ExtractOperation class extends BaseOperation to perform structured text extraction from unstructured documents. It supports two extraction strategies: a "line_number" strategy that reformats text with line numbers and asks the LLM to return start/end line ranges, and a "regex" strategy that asks the LLM to generate regex patterns matching the desired sections. Results are deduplicated and stored as new keys on the output documents with a configurable suffix.

Usage

Use this operation when you need to identify and extract specific sections from large documents, such as pulling out method sections from research papers, extracting specific clauses from legal contracts, or isolating relevant paragraphs from long-form content for downstream processing.

Code Reference

Source Location

Signature

class ExtractOperation(BaseOperation):
    class schema(BaseOperation.schema):
        type: str = "extract"
        prompt: str
        document_keys: list[str] = Field(..., min_items=1)
        model: str | None = None
        format_extraction: bool = True
        extraction_key_suffix: str | None = None
        extraction_method: Literal["line_number", "regex"] = "line_number"
        timeout: int | None = None
        skip_on_error: bool = False
        litellm_completion_kwargs: dict[str, Any] = Field(default_factory=dict)
        limit: int | None = Field(None, gt=0)

    def _reformat_text_with_line_numbers(self, text: str, line_width: int = 80) -> str: ...
    def _execute_line_number_strategy(self, item: dict, doc_key: str) -> tuple[list[str], float, str]: ...
    def _execute_regex_strategy(self, item: dict, doc_key: str) -> tuple[list[str], float, str]: ...
    def execute(self, input_data: list[dict]) -> tuple[list[dict], float]: ...

Import

from docetl.operations.extract import ExtractOperation

I/O Contract

Inputs

Name Type Required Description
input_data List[Dict] Yes Input documents containing text fields to extract from
prompt str Yes Jinja2 template prompt describing what to extract
document_keys List[str] Yes Keys in the input documents that contain the text to process
extraction_method str No Strategy to use: "line_number" (default) or "regex"
format_extraction bool No Whether to join extracted texts with newlines (default True) or return as list
extraction_key_suffix str No Suffix for output keys (default: "_extracted_{operation_name}")
model str No LLM model to use (defaults to pipeline default)
skip_on_error bool No Whether to skip errors gracefully (default False)
limit int No Maximum number of input documents to process

Outputs

Name Type Description
output Tuple[List[Dict], float] Documents with new extraction keys added and total cost

Usage Examples

# YAML pipeline configuration for extraction
operations:
  - name: extract_methods
    type: extract
    prompt: |
      Extract the methodology section from this research paper.
    document_keys:
      - paper_text
    extraction_method: line_number
    format_extraction: true
    model: "gpt-4o-mini"
# Python API usage
from docetl.operations.extract import ExtractOperation

config = {
    "name": "extract_clauses",
    "type": "extract",
    "prompt": "Extract all penalty clauses from this contract: {{ input.contract_text }}",
    "document_keys": ["contract_text"],
    "extraction_method": "regex",
    "format_extraction": False,
}
extract_op = ExtractOperation(runner, config, default_model, max_threads)
results, cost = extract_op.execute(input_data)
# Each result has a new key "contract_text_extracted_extract_clauses"

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment