Implementation:Ucbepic Docetl ExtractOperation Execute

Knowledge Sources	Ucbepic_Docetl DocETL Docs
Domains	Data_Processing, Text_Extraction
Last Updated	2026-02-08 00:00 GMT

Overview

Concrete tool for extracting specific text sections from documents using LLM-guided identification strategies, provided by DocETL.

Description

The ExtractOperation class extends BaseOperation to perform structured text extraction from unstructured documents. It supports two extraction strategies: a "line_number" strategy that reformats text with line numbers and asks the LLM to return start/end line ranges, and a "regex" strategy that asks the LLM to generate regex patterns matching the desired sections. Results are deduplicated and stored as new keys on the output documents with a configurable suffix.

Usage

Use this operation when you need to identify and extract specific sections from large documents, such as pulling out method sections from research papers, extracting specific clauses from legal contracts, or isolating relevant paragraphs from long-form content for downstream processing.

Code Reference

Source Location

Repository: Ucbepic_Docetl
File: docetl/operations/extract.py
Lines: 1-518

Signature

class ExtractOperation(BaseOperation):
    class schema(BaseOperation.schema):
        type: str = "extract"
        prompt: str
        document_keys: list[str] = Field(..., min_items=1)
        model: str | None = None
        format_extraction: bool = True
        extraction_key_suffix: str | None = None
        extraction_method: Literal["line_number", "regex"] = "line_number"
        timeout: int | None = None
        skip_on_error: bool = False
        litellm_completion_kwargs: dict[str, Any] = Field(default_factory=dict)
        limit: int | None = Field(None, gt=0)

    def _reformat_text_with_line_numbers(self, text: str, line_width: int = 80) -> str: ...
    def _execute_line_number_strategy(self, item: dict, doc_key: str) -> tuple[list[str], float, str]: ...
    def _execute_regex_strategy(self, item: dict, doc_key: str) -> tuple[list[str], float, str]: ...
    def execute(self, input_data: list[dict]) -> tuple[list[dict], float]: ...

Import

from docetl.operations.extract import ExtractOperation

I/O Contract

Inputs

Name	Type	Required	Description
input_data	List[Dict]	Yes	Input documents containing text fields to extract from
prompt	str	Yes	Jinja2 template prompt describing what to extract
document_keys	List[str]	Yes	Keys in the input documents that contain the text to process
extraction_method	str	No	Strategy to use: "line_number" (default) or "regex"
format_extraction	bool	No	Whether to join extracted texts with newlines (default True) or return as list
extraction_key_suffix	str	No	Suffix for output keys (default: "_extracted_{operation_name}")
model	str	No	LLM model to use (defaults to pipeline default)
skip_on_error	bool	No	Whether to skip errors gracefully (default False)
limit	int	No	Maximum number of input documents to process

Outputs

Name	Type	Description
output	Tuple[List[Dict], float]	Documents with new extraction keys added and total cost

Usage Examples

# YAML pipeline configuration for extraction
operations:
  - name: extract_methods
    type: extract
    prompt: |
      Extract the methodology section from this research paper.
    document_keys:
      - paper_text
    extraction_method: line_number
    format_extraction: true
    model: "gpt-4o-mini"

# Python API usage
from docetl.operations.extract import ExtractOperation

config = {
    "name": "extract_clauses",
    "type": "extract",
    "prompt": "Extract all penalty clauses from this contract: {{ input.contract_text }}",
    "document_keys": ["contract_text"],
    "extraction_method": "regex",
    "format_extraction": False,
}
extract_op = ExtractOperation(runner, config, default_model, max_threads)
results, cost = extract_op.execute(input_data)
# Each result has a new key "contract_text_extracted_extract_clauses"

Related Pages

Principle:Ucbepic_Docetl_LLM_Powered_Text_Extraction

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment