Implementation:Ucbepic Docetl ExtractOperation Execute
| Knowledge Sources | |
|---|---|
| Domains | Data_Processing, Text_Extraction |
| Last Updated | 2026-02-08 00:00 GMT |
Overview
Concrete tool for extracting specific text sections from documents using LLM-guided identification strategies, provided by DocETL.
Description
The ExtractOperation class extends BaseOperation to perform structured text extraction from unstructured documents. It supports two extraction strategies: a "line_number" strategy that reformats text with line numbers and asks the LLM to return start/end line ranges, and a "regex" strategy that asks the LLM to generate regex patterns matching the desired sections. Results are deduplicated and stored as new keys on the output documents with a configurable suffix.
Usage
Use this operation when you need to identify and extract specific sections from large documents, such as pulling out method sections from research papers, extracting specific clauses from legal contracts, or isolating relevant paragraphs from long-form content for downstream processing.
Code Reference
Source Location
- Repository: Ucbepic_Docetl
- File: docetl/operations/extract.py
- Lines: 1-518
Signature
class ExtractOperation(BaseOperation):
class schema(BaseOperation.schema):
type: str = "extract"
prompt: str
document_keys: list[str] = Field(..., min_items=1)
model: str | None = None
format_extraction: bool = True
extraction_key_suffix: str | None = None
extraction_method: Literal["line_number", "regex"] = "line_number"
timeout: int | None = None
skip_on_error: bool = False
litellm_completion_kwargs: dict[str, Any] = Field(default_factory=dict)
limit: int | None = Field(None, gt=0)
def _reformat_text_with_line_numbers(self, text: str, line_width: int = 80) -> str: ...
def _execute_line_number_strategy(self, item: dict, doc_key: str) -> tuple[list[str], float, str]: ...
def _execute_regex_strategy(self, item: dict, doc_key: str) -> tuple[list[str], float, str]: ...
def execute(self, input_data: list[dict]) -> tuple[list[dict], float]: ...
Import
from docetl.operations.extract import ExtractOperation
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| input_data | List[Dict] | Yes | Input documents containing text fields to extract from |
| prompt | str | Yes | Jinja2 template prompt describing what to extract |
| document_keys | List[str] | Yes | Keys in the input documents that contain the text to process |
| extraction_method | str | No | Strategy to use: "line_number" (default) or "regex" |
| format_extraction | bool | No | Whether to join extracted texts with newlines (default True) or return as list |
| extraction_key_suffix | str | No | Suffix for output keys (default: "_extracted_{operation_name}") |
| model | str | No | LLM model to use (defaults to pipeline default) |
| skip_on_error | bool | No | Whether to skip errors gracefully (default False) |
| limit | int | No | Maximum number of input documents to process |
Outputs
| Name | Type | Description |
|---|---|---|
| output | Tuple[List[Dict], float] | Documents with new extraction keys added and total cost |
Usage Examples
# YAML pipeline configuration for extraction
operations:
- name: extract_methods
type: extract
prompt: |
Extract the methodology section from this research paper.
document_keys:
- paper_text
extraction_method: line_number
format_extraction: true
model: "gpt-4o-mini"
# Python API usage
from docetl.operations.extract import ExtractOperation
config = {
"name": "extract_clauses",
"type": "extract",
"prompt": "Extract all penalty clauses from this contract: {{ input.contract_text }}",
"document_keys": ["contract_text"],
"extraction_method": "regex",
"format_extraction": False,
}
extract_op = ExtractOperation(runner, config, default_model, max_threads)
results, cost = extract_op.execute(input_data)
# Each result has a new key "contract_text_extracted_extract_clauses"