Implementation:Lm sys FastChat Topic Clustering
| Knowledge Sources | |
|---|---|
| Domains | Data_Processing, Model_Evaluation |
| Last Updated | 2026-02-07 06:00 GMT |
Overview
Clusters user prompts from FastChat conversations using text embeddings and multiple clustering algorithms, supporting both OpenAI and SentenceTransformer embedding models.
Description
Topic Clustering is a comprehensive text clustering module that groups user prompts into topical clusters using embedding-based similarity. The module implements an end-to-end pipeline: reading and filtering text data, generating embeddings, running clustering algorithms, and extracting representative samples from each cluster. It supports three clustering algorithms (KMeans, AgglomerativeClustering, and HDBSCAN) and two embedding backends (OpenAI API embeddings and local SentenceTransformer models).
The pipeline begins with read_texts, which loads prompts from an input file and applies configurable filters for minimum and maximum text length as well as optional English-only filtering. The get_embeddings function then converts the filtered texts into dense vector representations using either the OpenAI embedding API or a locally loaded SentenceTransformer model, processing texts in configurable batch sizes for memory efficiency.
The clustering stage offers three algorithms suited to different use cases. run_k_means provides fast, scalable clustering when the number of clusters is known in advance. run_agg_cluster uses hierarchical agglomerative clustering for more nuanced cluster structures. run_hdbscan_cluster performs density-based clustering that automatically determines the number of clusters and identifies noise points. After clustering, get_topk_indices finds the most representative examples from each cluster (those closest to cluster centers), and get_cluster_info aggregates cluster metadata including size, representative samples, and centroid information.
Usage
Use this module to discover what topics users are asking about in the FastChat arena. It is typically run periodically on accumulated conversation data to identify trending topics, inform category design for leaderboards, and provide input to the summarize_cluster module for human-readable topic labels.
Code Reference
Source Location
- Repository: Lm_sys_FastChat
- File: fastchat/serve/monitor/topic_clustering.py
- Lines: 1-292
Signature
def read_texts(
input_file: str,
min_length: int = 0,
max_length: int = 1000,
english_only: bool = False
) -> list:
"""Read and filter text prompts from an input file."""
def get_embeddings(
texts: list,
model_name: str = "text-embedding-ada-002",
batch_size: int = 128
) -> np.ndarray:
"""Generate embeddings for a list of texts using OpenAI or SentenceTransformer models."""
def run_k_means(embeddings: np.ndarray, num_clusters: int) -> tuple:
"""Run KMeans clustering on embeddings, returning labels and cluster centers."""
def run_agg_cluster(embeddings: np.ndarray, num_clusters: int) -> np.ndarray:
"""Run Agglomerative Clustering on embeddings, returning cluster labels."""
def run_hdbscan_cluster(embeddings: np.ndarray) -> np.ndarray:
"""Run HDBSCAN density-based clustering on embeddings, returning cluster labels."""
def get_topk_indices(
centers: np.ndarray,
labels: np.ndarray,
embeddings: np.ndarray,
topk: int
) -> dict:
"""Find the top-k most representative sample indices for each cluster."""
def print_topk(
texts: list,
labels: np.ndarray,
topk_indices: dict,
show_cut_off: int = 200
) -> None:
"""Print the top-k representative samples for each cluster to stdout."""
def get_cluster_info(
texts: list,
labels: np.ndarray,
topk_indices: dict
) -> dict:
"""Aggregate cluster metadata including sizes and representative texts."""
Import
from fastchat.serve.monitor.topic_clustering import run_k_means, get_embeddings
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| input_file | str | Yes | Path to a file containing text prompts (one per line or JSONL) |
| min_length | int | No | Minimum text length to include (default: 0) |
| max_length | int | No | Maximum text length to include (default: 1000) |
| english_only | bool | No | Whether to filter for English-only texts (default: False) |
| texts | list[str] | Yes | List of text strings to embed (used by get_embeddings) |
| model_name | str | No | Embedding model identifier (default: "text-embedding-ada-002") |
| batch_size | int | No | Number of texts per embedding API call (default: 128) |
| embeddings | np.ndarray | Yes | Embedding matrix of shape (n_samples, embedding_dim) (used by clustering functions) |
| num_clusters | int | Yes | Number of clusters for KMeans and Agglomerative (not used by HDBSCAN) |
| centers | np.ndarray | Yes | Cluster center vectors (used by get_topk_indices) |
| labels | np.ndarray | Yes | Cluster assignment labels per sample |
| topk | int | Yes | Number of representative samples per cluster (used by get_topk_indices) |
Outputs
| Name | Type | Description |
|---|---|---|
| texts | list[str] | read_texts returns filtered list of prompt strings |
| embeddings | np.ndarray | get_embeddings returns array of shape (n_texts, embedding_dim) |
| labels, centers | tuple[np.ndarray, np.ndarray] | run_k_means returns cluster labels and center vectors |
| labels | np.ndarray | run_agg_cluster and run_hdbscan_cluster return cluster label arrays |
| topk_indices | dict[int, list[int]] | get_topk_indices returns a mapping from cluster ID to list of sample indices |
| cluster_info | dict | get_cluster_info returns a dictionary with cluster sizes, representative texts, and metadata |
Supported Algorithms
| Algorithm | Function | Requires num_clusters | Notes |
|---|---|---|---|
| KMeans | run_k_means | Yes | Fast and scalable; returns both labels and cluster centers |
| Agglomerative | run_agg_cluster | Yes | Hierarchical clustering; better for non-spherical clusters |
| HDBSCAN | run_hdbscan_cluster | No | Density-based; automatically determines cluster count; identifies noise points as label -1 |
Supported Embedding Models
| Backend | Model Example | Notes |
|---|---|---|
| OpenAI API | text-embedding-ada-002 | Requires API key; high quality; remote API call |
| SentenceTransformer | all-MiniLM-L6-v2 | Local execution; no API key needed; configurable models |
Usage Examples
from fastchat.serve.monitor.topic_clustering import (
read_texts,
get_embeddings,
run_k_means,
get_topk_indices,
print_topk,
get_cluster_info,
)
# Step 1: Read and filter prompts
texts = read_texts(
"conversations.jsonl",
min_length=10,
max_length=500,
english_only=True,
)
print(f"Loaded {len(texts)} prompts after filtering")
# Step 2: Generate embeddings
embeddings = get_embeddings(
texts,
model_name="text-embedding-ada-002",
batch_size=64,
)
print(f"Embedding shape: {embeddings.shape}")
# Step 3: Cluster with KMeans
num_clusters = 20
labels, centers = run_k_means(embeddings, num_clusters)
# Step 4: Find representative samples
topk_indices = get_topk_indices(centers, labels, embeddings, topk=5)
# Step 5: Display results
print_topk(texts, labels, topk_indices, show_cut_off=150)
# Step 6: Get structured cluster info for downstream use
cluster_info = get_cluster_info(texts, labels, topk_indices)
for cid, info in cluster_info.items():
print(f"Cluster {cid}: {info['size']} conversations")
Related Pages
- Principle:Lm_sys_FastChat_Prompt_Topic_Clustering
- Implements: Principle:Lm_sys_FastChat_Prompt_Topic_Clustering
- Lm_sys_FastChat_Summarize_Cluster - Generates human-readable topic labels from cluster output
- Lm_sys_FastChat_Criteria_Labeling - Complementary labeling approach using predefined criteria
- Lm_sys_FastChat_Deduplication - Deduplication that benefits from cluster-aware processing
- Lm_sys_FastChat_Monitor_Markdown - Renders category-specific leaderboards informed by topic clusters