Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Ucbepic Docetl FilterOperation Execute

From Leeroopedia


Knowledge Sources
Domains Data_Processing, Semantic_Filtering
Last Updated 2026-02-08 00:00 GMT

Overview

Concrete tool for filtering documents using LLM-based boolean evaluation against a user-defined prompt, provided by DocETL.

Description

The FilterOperation class extends MapOperation to provide LLM-powered semantic filtering. It reuses the map operation's LLM execution logic but enforces that the output schema contains exactly one boolean field. The LLM evaluates each document against the prompt and returns a boolean judgment; documents where the judgment is False are removed from the output. This is a lightweight 79-line subclass that delegates most of its logic to its parent class.

Usage

Use this operation when you need to filter documents based on complex semantic criteria that cannot be expressed as simple programmatic rules. Typical scenarios include removing irrelevant documents from a corpus, filtering out low-quality entries, selecting documents that match specific topical criteria, or enforcing content policies.

Code Reference

Source Location

Signature

class FilterOperation(MapOperation):
    class schema(MapOperation.schema):
        type: str = "filter"
        prompt: str
        output: dict[str, Any]

    def __init__(self, *args, **kwargs): ...

    def _limit_applies_to_inputs(self) -> bool: ...

    def _handle_result(self, result: dict[str, Any]) -> tuple[dict | None, bool]: ...

    def execute(self, input_data: list[dict], is_build: bool = False) -> tuple[list[dict], float]: ...

Import

from docetl.operations.filter import FilterOperation

I/O Contract

Inputs

Name Type Required Description
input_data List[Dict] Yes Documents to filter
prompt str Yes Jinja2 template prompt for LLM evaluation of each document
output Dict Yes Output configuration with a schema containing exactly one boolean key
is_build bool No Whether the operation is in build phase (if True, keeps all documents)

Outputs

Name Type Description
output Tuple[List[Dict], float] Filtered documents (only those where LLM returned True) and total cost

Usage Examples

# YAML pipeline configuration for filtering
operations:
  - name: filter_relevant
    type: filter
    prompt: |
      Determine if this document is relevant to climate change research.
      Document: {{ input.title }} - {{ input.abstract }}
    output:
      schema:
        is_relevant: bool
    model: "gpt-4o-mini"
# Python API usage
from docetl.operations.filter import FilterOperation

config = {
    "name": "quality_filter",
    "type": "filter",
    "prompt": "Is this text well-written and coherent? Text: {{ input.content }}",
    "output": {
        "schema": {
            "is_quality": "bool",
        }
    },
}
filter_op = FilterOperation(runner, config, default_model, max_threads)
filtered_results, cost = filter_op.execute(input_data)
# Only documents where the LLM judged is_quality=True are returned

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment