Implementation:Ucbepic Docetl FilterOperation Execute

Knowledge Sources	Ucbepic_Docetl DocETL Docs
Domains	Data_Processing, Semantic_Filtering
Last Updated	2026-02-08 00:00 GMT

Overview

Concrete tool for filtering documents using LLM-based boolean evaluation against a user-defined prompt, provided by DocETL.

Description

The FilterOperation class extends MapOperation to provide LLM-powered semantic filtering. It reuses the map operation's LLM execution logic but enforces that the output schema contains exactly one boolean field. The LLM evaluates each document against the prompt and returns a boolean judgment; documents where the judgment is False are removed from the output. This is a lightweight 79-line subclass that delegates most of its logic to its parent class.

Usage

Use this operation when you need to filter documents based on complex semantic criteria that cannot be expressed as simple programmatic rules. Typical scenarios include removing irrelevant documents from a corpus, filtering out low-quality entries, selecting documents that match specific topical criteria, or enforcing content policies.

Code Reference

Source Location

Repository: Ucbepic_Docetl
File: docetl/operations/filter.py
Lines: 1-79

Signature

class FilterOperation(MapOperation):
    class schema(MapOperation.schema):
        type: str = "filter"
        prompt: str
        output: dict[str, Any]

    def __init__(self, *args, **kwargs): ...

    def _limit_applies_to_inputs(self) -> bool: ...

    def _handle_result(self, result: dict[str, Any]) -> tuple[dict | None, bool]: ...

    def execute(self, input_data: list[dict], is_build: bool = False) -> tuple[list[dict], float]: ...

Import

from docetl.operations.filter import FilterOperation

I/O Contract

Inputs

Name	Type	Required	Description
input_data	List[Dict]	Yes	Documents to filter
prompt	str	Yes	Jinja2 template prompt for LLM evaluation of each document
output	Dict	Yes	Output configuration with a schema containing exactly one boolean key
is_build	bool	No	Whether the operation is in build phase (if True, keeps all documents)

Outputs

Name	Type	Description
output	Tuple[List[Dict], float]	Filtered documents (only those where LLM returned True) and total cost

Usage Examples

# YAML pipeline configuration for filtering
operations:
  - name: filter_relevant
    type: filter
    prompt: |
      Determine if this document is relevant to climate change research.
      Document: {{ input.title }} - {{ input.abstract }}
    output:
      schema:
        is_relevant: bool
    model: "gpt-4o-mini"

# Python API usage
from docetl.operations.filter import FilterOperation

config = {
    "name": "quality_filter",
    "type": "filter",
    "prompt": "Is this text well-written and coherent? Text: {{ input.content }}",
    "output": {
        "schema": {
            "is_quality": "bool",
        }
    },
}
filter_op = FilterOperation(runner, config, default_model, max_threads)
filtered_results, cost = filter_op.execute(input_data)
# Only documents where the LLM judged is_quality=True are returned

Related Pages

Principle:Ucbepic_Docetl_LLM_Powered_Semantic_Filtering

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment