Implementation:Ucbepic Docetl ClusterOperation Execute
| Knowledge Sources | |
|---|---|
| Domains | Data_Processing, Document_Clustering |
| Last Updated | 2026-02-08 00:00 GMT |
Overview
Concrete tool for performing hierarchical clustering on documents with LLM-generated cluster summaries, provided by DocETL.
Description
The ClusterOperation class extends BaseOperation to group documents into a hierarchical tree structure using agglomerative clustering on document embeddings. After building the cluster tree, it uses LLM calls to generate human-readable summaries for each internal node based on its children. Each leaf document is then annotated with its full cluster path as a tuple of ancestor summaries. The tree can optionally be collapsed based on a distance threshold to reduce granularity.
Usage
Use this operation when you need to semantically group documents into meaningful categories with interpretable labels. Typical scenarios include organizing large document collections into topic hierarchies, grouping customer feedback by theme, creating taxonomies from unstructured text, or preparing data for stratified downstream operations like reduce or sample.
Code Reference
Source Location
- Repository: Ucbepic_Docetl
- File: docetl/operations/cluster.py
- Lines: 1-282
Signature
class ClusterOperation(BaseOperation):
def __init__(self, *args, **kwargs): ...
def syntax_check(self) -> None: ...
def execute(self, input_data: list[dict], is_build: bool = False) -> tuple[list[dict], float]: ...
def agglomerative_cluster_of_embeddings(self, input_data, embeddings): ...
def get_tree_distances(self, t): ...
def collapse_tree(self, tree, collapse=None): ...
def annotate_clustering_tree(self, t) -> float: ...
def annotate_leaves(self, tree, path=()): ...
Import
from docetl.operations.cluster import ClusterOperation
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| input_data | List[Dict] | Yes | Documents to cluster |
| embedding_keys | List[str] | Yes | Keys in documents to use for computing embeddings |
| summary_schema | Dict | Yes | Schema for cluster summary output from LLM |
| summary_prompt | str | Yes | Jinja2 template prompt for generating cluster summaries from children |
| output_key | str | No | Key to store cluster path on each document (default "clusters") |
| collapse | float | No | Quantile threshold for collapsing close tree nodes (0.0-1.0) |
| max_batch_size | int | No | Maximum concurrent threads for tree annotation |
| embedding_model | str | No | Model for computing document embeddings |
| model | str | No | LLM model for generating summaries |
| validate | List[str] | No | Validation rules for LLM outputs |
Outputs
| Name | Type | Description |
|---|---|---|
| output | Tuple[List[Dict], float] | Input documents modified in-place with cluster path tuples and total cost |
Usage Examples
# YAML pipeline configuration for clustering
operations:
- name: cluster_articles
type: cluster
embedding_keys:
- title
- abstract
summary_schema:
topic: string
description: string
summary_prompt: |
Given these clustered items: {{ inputs }}
Provide a topic label and brief description for this cluster.
output_key: clusters
collapse: 0.5
model: "gpt-4o-mini"
# Python API usage
from docetl.operations.cluster import ClusterOperation
config = {
"name": "group_feedback",
"type": "cluster",
"embedding_keys": ["feedback_text"],
"summary_schema": {"theme": "string"},
"summary_prompt": "Summarize the common theme: {{ inputs }}",
"output_key": "clusters",
}
cluster_op = ClusterOperation(runner, config, default_model, max_threads)
results, cost = cluster_op.execute(input_data)
# Each document now has a "clusters" key with its cluster path tuple