Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Ucbepic Docetl ClusterOperation Execute

From Leeroopedia
Revision as of 16:59, 16 February 2026 by Admin (talk | contribs) (Auto-imported from implementations/Ucbepic_Docetl_ClusterOperation_Execute.md)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)


Knowledge Sources
Domains Data_Processing, Document_Clustering
Last Updated 2026-02-08 00:00 GMT

Overview

Concrete tool for performing hierarchical clustering on documents with LLM-generated cluster summaries, provided by DocETL.

Description

The ClusterOperation class extends BaseOperation to group documents into a hierarchical tree structure using agglomerative clustering on document embeddings. After building the cluster tree, it uses LLM calls to generate human-readable summaries for each internal node based on its children. Each leaf document is then annotated with its full cluster path as a tuple of ancestor summaries. The tree can optionally be collapsed based on a distance threshold to reduce granularity.

Usage

Use this operation when you need to semantically group documents into meaningful categories with interpretable labels. Typical scenarios include organizing large document collections into topic hierarchies, grouping customer feedback by theme, creating taxonomies from unstructured text, or preparing data for stratified downstream operations like reduce or sample.

Code Reference

Source Location

Signature

class ClusterOperation(BaseOperation):
    def __init__(self, *args, **kwargs): ...

    def syntax_check(self) -> None: ...

    def execute(self, input_data: list[dict], is_build: bool = False) -> tuple[list[dict], float]: ...

    def agglomerative_cluster_of_embeddings(self, input_data, embeddings): ...

    def get_tree_distances(self, t): ...

    def collapse_tree(self, tree, collapse=None): ...

    def annotate_clustering_tree(self, t) -> float: ...

    def annotate_leaves(self, tree, path=()): ...

Import

from docetl.operations.cluster import ClusterOperation

I/O Contract

Inputs

Name Type Required Description
input_data List[Dict] Yes Documents to cluster
embedding_keys List[str] Yes Keys in documents to use for computing embeddings
summary_schema Dict Yes Schema for cluster summary output from LLM
summary_prompt str Yes Jinja2 template prompt for generating cluster summaries from children
output_key str No Key to store cluster path on each document (default "clusters")
collapse float No Quantile threshold for collapsing close tree nodes (0.0-1.0)
max_batch_size int No Maximum concurrent threads for tree annotation
embedding_model str No Model for computing document embeddings
model str No LLM model for generating summaries
validate List[str] No Validation rules for LLM outputs

Outputs

Name Type Description
output Tuple[List[Dict], float] Input documents modified in-place with cluster path tuples and total cost

Usage Examples

# YAML pipeline configuration for clustering
operations:
  - name: cluster_articles
    type: cluster
    embedding_keys:
      - title
      - abstract
    summary_schema:
      topic: string
      description: string
    summary_prompt: |
      Given these clustered items: {{ inputs }}
      Provide a topic label and brief description for this cluster.
    output_key: clusters
    collapse: 0.5
    model: "gpt-4o-mini"
# Python API usage
from docetl.operations.cluster import ClusterOperation

config = {
    "name": "group_feedback",
    "type": "cluster",
    "embedding_keys": ["feedback_text"],
    "summary_schema": {"theme": "string"},
    "summary_prompt": "Summarize the common theme: {{ inputs }}",
    "output_key": "clusters",
}
cluster_op = ClusterOperation(runner, config, default_model, max_threads)
results, cost = cluster_op.execute(input_data)
# Each document now has a "clusters" key with its cluster path tuple

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment