Implementation:Ucbepic Docetl FilterOperation Execute
| Knowledge Sources | |
|---|---|
| Domains | Data_Processing, Semantic_Filtering |
| Last Updated | 2026-02-08 00:00 GMT |
Overview
Concrete tool for filtering documents using LLM-based boolean evaluation against a user-defined prompt, provided by DocETL.
Description
The FilterOperation class extends MapOperation to provide LLM-powered semantic filtering. It reuses the map operation's LLM execution logic but enforces that the output schema contains exactly one boolean field. The LLM evaluates each document against the prompt and returns a boolean judgment; documents where the judgment is False are removed from the output. This is a lightweight 79-line subclass that delegates most of its logic to its parent class.
Usage
Use this operation when you need to filter documents based on complex semantic criteria that cannot be expressed as simple programmatic rules. Typical scenarios include removing irrelevant documents from a corpus, filtering out low-quality entries, selecting documents that match specific topical criteria, or enforcing content policies.
Code Reference
Source Location
- Repository: Ucbepic_Docetl
- File: docetl/operations/filter.py
- Lines: 1-79
Signature
class FilterOperation(MapOperation):
class schema(MapOperation.schema):
type: str = "filter"
prompt: str
output: dict[str, Any]
def __init__(self, *args, **kwargs): ...
def _limit_applies_to_inputs(self) -> bool: ...
def _handle_result(self, result: dict[str, Any]) -> tuple[dict | None, bool]: ...
def execute(self, input_data: list[dict], is_build: bool = False) -> tuple[list[dict], float]: ...
Import
from docetl.operations.filter import FilterOperation
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| input_data | List[Dict] | Yes | Documents to filter |
| prompt | str | Yes | Jinja2 template prompt for LLM evaluation of each document |
| output | Dict | Yes | Output configuration with a schema containing exactly one boolean key |
| is_build | bool | No | Whether the operation is in build phase (if True, keeps all documents) |
Outputs
| Name | Type | Description |
|---|---|---|
| output | Tuple[List[Dict], float] | Filtered documents (only those where LLM returned True) and total cost |
Usage Examples
# YAML pipeline configuration for filtering
operations:
- name: filter_relevant
type: filter
prompt: |
Determine if this document is relevant to climate change research.
Document: {{ input.title }} - {{ input.abstract }}
output:
schema:
is_relevant: bool
model: "gpt-4o-mini"
# Python API usage
from docetl.operations.filter import FilterOperation
config = {
"name": "quality_filter",
"type": "filter",
"prompt": "Is this text well-written and coherent? Text: {{ input.content }}",
"output": {
"schema": {
"is_quality": "bool",
}
},
}
filter_op = FilterOperation(runner, config, default_model, max_threads)
filtered_results, cost = filter_op.execute(input_data)
# Only documents where the LLM judged is_quality=True are returned