Implementation:Ucbepic Docetl ClusterOperation Execute

Knowledge Sources	Ucbepic_Docetl DocETL Docs
Domains	Data_Processing, Document_Clustering
Last Updated	2026-02-08 00:00 GMT

Overview

Concrete tool for performing hierarchical clustering on documents with LLM-generated cluster summaries, provided by DocETL.

Description

The ClusterOperation class extends BaseOperation to group documents into a hierarchical tree structure using agglomerative clustering on document embeddings. After building the cluster tree, it uses LLM calls to generate human-readable summaries for each internal node based on its children. Each leaf document is then annotated with its full cluster path as a tuple of ancestor summaries. The tree can optionally be collapsed based on a distance threshold to reduce granularity.

Usage

Use this operation when you need to semantically group documents into meaningful categories with interpretable labels. Typical scenarios include organizing large document collections into topic hierarchies, grouping customer feedback by theme, creating taxonomies from unstructured text, or preparing data for stratified downstream operations like reduce or sample.

Code Reference

Source Location

Repository: Ucbepic_Docetl
File: docetl/operations/cluster.py
Lines: 1-282

Signature

class ClusterOperation(BaseOperation):
    def __init__(self, *args, **kwargs): ...

    def syntax_check(self) -> None: ...

    def execute(self, input_data: list[dict], is_build: bool = False) -> tuple[list[dict], float]: ...

    def agglomerative_cluster_of_embeddings(self, input_data, embeddings): ...

    def get_tree_distances(self, t): ...

    def collapse_tree(self, tree, collapse=None): ...

    def annotate_clustering_tree(self, t) -> float: ...

    def annotate_leaves(self, tree, path=()): ...

Import

from docetl.operations.cluster import ClusterOperation

I/O Contract

Inputs

Name	Type	Required	Description
input_data	List[Dict]	Yes	Documents to cluster
embedding_keys	List[str]	Yes	Keys in documents to use for computing embeddings
summary_schema	Dict	Yes	Schema for cluster summary output from LLM
summary_prompt	str	Yes	Jinja2 template prompt for generating cluster summaries from children
output_key	str	No	Key to store cluster path on each document (default "clusters")
collapse	float	No	Quantile threshold for collapsing close tree nodes (0.0-1.0)
max_batch_size	int	No	Maximum concurrent threads for tree annotation
embedding_model	str	No	Model for computing document embeddings
model	str	No	LLM model for generating summaries
validate	List[str]	No	Validation rules for LLM outputs

Outputs

Name	Type	Description
output	Tuple[List[Dict], float]	Input documents modified in-place with cluster path tuples and total cost

Usage Examples

# YAML pipeline configuration for clustering
operations:
  - name: cluster_articles
    type: cluster
    embedding_keys:
      - title
      - abstract
    summary_schema:
      topic: string
      description: string
    summary_prompt: |
      Given these clustered items: {{ inputs }}
      Provide a topic label and brief description for this cluster.
    output_key: clusters
    collapse: 0.5
    model: "gpt-4o-mini"

# Python API usage
from docetl.operations.cluster import ClusterOperation

config = {
    "name": "group_feedback",
    "type": "cluster",
    "embedding_keys": ["feedback_text"],
    "summary_schema": {"theme": "string"},
    "summary_prompt": "Summarize the common theme: {{ inputs }}",
    "output_key": "clusters",
}
cluster_op = ClusterOperation(runner, config, default_model, max_threads)
results, cost = cluster_op.execute(input_data)
# Each document now has a "clusters" key with its cluster path tuple

Related Pages

Principle:Ucbepic_Docetl_Hierarchical_Document_Clustering

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment