Principle:Lm sys FastChat Prompt Topic Clustering

Field	Value
Page Type	Principle
Title	Prompt Topic Clustering
Repository	lm-sys/FastChat
Workflow	Arena_Data_Analysis
Domains	NLP, Clustering
Knowledge Sources	fastchat/serve/monitor/topic_clustering.py, fastchat/serve/monitor/summarize_cluster.py
Last Updated	2026-02-07 14:00 GMT

Overview

This principle describes the methodology for discovering latent topic structure in Arena prompts using embedding-based clustering and LLM-powered summarization. By grouping semantically similar prompts into clusters and generating human-readable labels for each cluster, this principle enables categorical analysis of Arena usage patterns -- revealing what users ask about, how topic distributions shift over time, and whether certain models are preferred for specific topic categories.

Description

Sentence Embedding Computation

The first stage of topic clustering transforms each prompt from variable-length text into a fixed-dimensional vector representation using a sentence embedding model (e.g., via the sentence-transformers library). Models such as all-MiniLM-L6-v2 or all-mpnet-base-v2 encode semantic meaning into dense vectors, where prompts with similar meaning occupy nearby regions of the embedding space. Embedding computation is performed in batches for efficiency, and the resulting vectors are stored as a matrix of shape (num_prompts, embedding_dim).

Dimensionality Reduction

High-dimensional embedding vectors (typically 384 to 768 dimensions) are projected into a lower-dimensional space before clustering. Dimensionality reduction serves two purposes: it reduces computational cost for clustering algorithms and it mitigates the curse of dimensionality (where distance metrics become less meaningful in high-dimensional spaces). Common techniques include UMAP (Uniform Manifold Approximation and Projection), which preserves local neighborhood structure, and PCA (Principal Component Analysis), which preserves global variance. UMAP is generally preferred for clustering tasks because it better preserves the cluster structure of the data.

KMeans Clustering

KMeans partitions the reduced-dimensional embeddings into a predetermined number of clusters (k) by iteratively assigning each point to the nearest centroid and updating centroids to be the mean of assigned points. KMeans is computationally efficient and produces clusters of roughly equal size. The number of clusters k is typically chosen via the elbow method or silhouette analysis, or set heuristically based on the desired granularity of topic categories. KMeans assumes spherical, equally-sized clusters, which may not perfectly capture the true topic structure but provides a useful initial partitioning.

HDBSCAN Clustering

HDBSCAN (Hierarchical Density-Based Spatial Clustering of Applications with Noise) offers an alternative that does not require specifying the number of clusters a priori. It discovers clusters of varying density and designates low-density points as noise (outliers). This is particularly useful for Arena prompts, where some topics form tight clusters (e.g., common programming questions) while others are diffuse (e.g., creative writing prompts with diverse themes). HDBSCAN's ability to identify noise points also helps isolate unusual or adversarial prompts that do not belong to any coherent topic.

Cluster Label Generation via LLM Summarization

Once clusters are formed, their content must be summarized into human-interpretable labels. This is achieved by sampling representative prompts from each cluster and submitting them to a large language model with instructions to generate a concise topic label and description. The LLM-based approach produces more nuanced and accurate labels than keyword-based methods (e.g., TF-IDF top terms), especially for clusters defined by semantic similarity rather than lexical overlap. Generated labels are reviewed and optionally refined to create a final topic taxonomy.

Theoretical Basis

Clustering in embedding space groups semantically similar prompts by leveraging the property that pretrained sentence encoders map text with similar meaning to nearby vectors. The theoretical justification rests on the distributional hypothesis -- that words (and by extension, sentences) occurring in similar contexts have similar meanings -- which underlies all embedding-based NLP methods. KMeans assumes clusters are spherical and of equal size, optimizing an objective function that minimizes within-cluster sum of squared distances to centroids. This assumption is a limitation when true topic clusters vary in size and shape, but KMeans provides computational efficiency and deterministic convergence. HDBSCAN relaxes these assumptions by defining clusters as connected regions of high density in the data space, discovering clusters of varying density and shape while naturally handling outliers. The theoretical foundation is rooted in topological data analysis and the concept of persistence in hierarchical clustering dendrograms. LLM-generated cluster summaries provide human-interpretable labels for discovered topics, bridging the gap between mathematical cluster assignments and actionable categorical analysis. This approach leverages the in-context learning capability of large language models to perform few-shot summarization, enabling categorical analysis of Arena usage patterns without manual annotation.

Related Pages

Implementation:Lm_sys_FastChat_Topic_Clustering
Implementation:Lm_sys_FastChat_Summarize_Cluster
Implemented by: Implementation:Lm_sys_FastChat_Topic_Clustering
Implemented by: Implementation:Lm_sys_FastChat_Summarize_Cluster

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment