Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Principle:Ucbepic Docetl Hierarchical Document Clustering

From Leeroopedia


Knowledge Sources
Domains LLM_Data_Processing, Document_Clustering
Last Updated 2026-02-08 00:00 GMT

Overview

Multi-level document clustering combines agglomerative clustering on embedding vectors with LLM-generated summaries at each tree node, producing a hierarchical grouping where every level carries human-readable descriptions.

Theoretical Basis

Flat clustering algorithms like KMeans assign each document to exactly one cluster, but real document collections often exhibit hierarchical structure -- broad topics subdivide into subtopics, which further divide into specific themes. Agglomerative (bottom-up) hierarchical clustering captures this structure naturally by progressively merging the two most similar items or groups until a single root cluster remains, producing a complete dendrogram.

DocETL's cluster operation builds on scikit-learn's AgglomerativeClustering applied to document embeddings. The resulting binary merge tree is then optionally collapsed using a distance-based heuristic: merges where the distance gap between parent and child is below a quantile-based threshold are flattened, producing wider but shallower trees. This collapse step converts the strict binary tree into a more interpretable multi-way tree where each internal node represents a meaningful grouping rather than an arbitrary binary split.

The distinguishing feature of DocETL's approach is LLM-powered annotation. After building the tree, the operation traverses it bottom-up, calling an LLM at each internal node with a summary prompt that receives the node's children (which may be individual documents or previously summarized sub-clusters). The LLM generates a structured summary according to a user-defined schema, and this summary is stored on the node. Finally, each leaf document is annotated with its full path of cluster summaries from root to leaf, giving every document rich hierarchical context. This combination of statistical clustering with LLM summarization produces groupings that are both data-driven and human-interpretable.

Key Design Decisions

Decision Choice Rationale
Clustering algorithm Agglomerative clustering with full dendrogram Produces a complete hierarchy without requiring a pre-specified number of clusters; distances are computed for all merges
Tree collapse Distance-gap quantile threshold to flatten near-equal merges Converts binary tree into readable multi-way tree; controlled by a single collapse parameter (0 to 1 quantile)
Node annotation Bottom-up LLM summarization with user-defined schema Each cluster gets a human-readable description; bottom-up traversal ensures child summaries are available before parent annotation

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment