Implementation:Lm sys FastChat Topic Clustering

Knowledge Sources	Lm_sys_FastChat
Domains	Data_Processing, Model_Evaluation
Last Updated	2026-02-07 06:00 GMT

Overview

Clusters user prompts from FastChat conversations using text embeddings and multiple clustering algorithms, supporting both OpenAI and SentenceTransformer embedding models.

Description

Topic Clustering is a comprehensive text clustering module that groups user prompts into topical clusters using embedding-based similarity. The module implements an end-to-end pipeline: reading and filtering text data, generating embeddings, running clustering algorithms, and extracting representative samples from each cluster. It supports three clustering algorithms (KMeans, AgglomerativeClustering, and HDBSCAN) and two embedding backends (OpenAI API embeddings and local SentenceTransformer models).

The pipeline begins with read_texts, which loads prompts from an input file and applies configurable filters for minimum and maximum text length as well as optional English-only filtering. The get_embeddings function then converts the filtered texts into dense vector representations using either the OpenAI embedding API or a locally loaded SentenceTransformer model, processing texts in configurable batch sizes for memory efficiency.

The clustering stage offers three algorithms suited to different use cases. run_k_means provides fast, scalable clustering when the number of clusters is known in advance. run_agg_cluster uses hierarchical agglomerative clustering for more nuanced cluster structures. run_hdbscan_cluster performs density-based clustering that automatically determines the number of clusters and identifies noise points. After clustering, get_topk_indices finds the most representative examples from each cluster (those closest to cluster centers), and get_cluster_info aggregates cluster metadata including size, representative samples, and centroid information.

Usage

Use this module to discover what topics users are asking about in the FastChat arena. It is typically run periodically on accumulated conversation data to identify trending topics, inform category design for leaderboards, and provide input to the summarize_cluster module for human-readable topic labels.

Code Reference

Source Location

Repository: Lm_sys_FastChat
File: fastchat/serve/monitor/topic_clustering.py
Lines: 1-292

Signature

def read_texts(
    input_file: str,
    min_length: int = 0,
    max_length: int = 1000,
    english_only: bool = False
) -> list:
    """Read and filter text prompts from an input file."""

def get_embeddings(
    texts: list,
    model_name: str = "text-embedding-ada-002",
    batch_size: int = 128
) -> np.ndarray:
    """Generate embeddings for a list of texts using OpenAI or SentenceTransformer models."""

def run_k_means(embeddings: np.ndarray, num_clusters: int) -> tuple:
    """Run KMeans clustering on embeddings, returning labels and cluster centers."""

def run_agg_cluster(embeddings: np.ndarray, num_clusters: int) -> np.ndarray:
    """Run Agglomerative Clustering on embeddings, returning cluster labels."""

def run_hdbscan_cluster(embeddings: np.ndarray) -> np.ndarray:
    """Run HDBSCAN density-based clustering on embeddings, returning cluster labels."""

def get_topk_indices(
    centers: np.ndarray,
    labels: np.ndarray,
    embeddings: np.ndarray,
    topk: int
) -> dict:
    """Find the top-k most representative sample indices for each cluster."""

def print_topk(
    texts: list,
    labels: np.ndarray,
    topk_indices: dict,
    show_cut_off: int = 200
) -> None:
    """Print the top-k representative samples for each cluster to stdout."""

def get_cluster_info(
    texts: list,
    labels: np.ndarray,
    topk_indices: dict
) -> dict:
    """Aggregate cluster metadata including sizes and representative texts."""

Import

from fastchat.serve.monitor.topic_clustering import run_k_means, get_embeddings

I/O Contract

Inputs

Name	Type	Required	Description
input_file	str	Yes	Path to a file containing text prompts (one per line or JSONL)
min_length	int	No	Minimum text length to include (default: 0)
max_length	int	No	Maximum text length to include (default: 1000)
english_only	bool	No	Whether to filter for English-only texts (default: False)
texts	list[str]	Yes	List of text strings to embed (used by get_embeddings)
model_name	str	No	Embedding model identifier (default: "text-embedding-ada-002")
batch_size	int	No	Number of texts per embedding API call (default: 128)
embeddings	np.ndarray	Yes	Embedding matrix of shape (n_samples, embedding_dim) (used by clustering functions)
num_clusters	int	Yes	Number of clusters for KMeans and Agglomerative (not used by HDBSCAN)
centers	np.ndarray	Yes	Cluster center vectors (used by get_topk_indices)
labels	np.ndarray	Yes	Cluster assignment labels per sample
topk	int	Yes	Number of representative samples per cluster (used by get_topk_indices)

Outputs

Name	Type	Description
texts	list[str]	read_texts returns filtered list of prompt strings
embeddings	np.ndarray	get_embeddings returns array of shape (n_texts, embedding_dim)
labels, centers	tuple[np.ndarray, np.ndarray]	run_k_means returns cluster labels and center vectors
labels	np.ndarray	run_agg_cluster and run_hdbscan_cluster return cluster label arrays
topk_indices	dict[int, list[int]]	get_topk_indices returns a mapping from cluster ID to list of sample indices
cluster_info	dict	get_cluster_info returns a dictionary with cluster sizes, representative texts, and metadata

Supported Algorithms

Algorithm	Function	Requires num_clusters	Notes
KMeans	run_k_means	Yes	Fast and scalable; returns both labels and cluster centers
Agglomerative	run_agg_cluster	Yes	Hierarchical clustering; better for non-spherical clusters
HDBSCAN	run_hdbscan_cluster	No	Density-based; automatically determines cluster count; identifies noise points as label -1

Supported Embedding Models

Backend	Model Example	Notes
OpenAI API	text-embedding-ada-002	Requires API key; high quality; remote API call
SentenceTransformer	all-MiniLM-L6-v2	Local execution; no API key needed; configurable models

Usage Examples

from fastchat.serve.monitor.topic_clustering import (
    read_texts,
    get_embeddings,
    run_k_means,
    get_topk_indices,
    print_topk,
    get_cluster_info,
)

# Step 1: Read and filter prompts
texts = read_texts(
    "conversations.jsonl",
    min_length=10,
    max_length=500,
    english_only=True,
)
print(f"Loaded {len(texts)} prompts after filtering")

# Step 2: Generate embeddings
embeddings = get_embeddings(
    texts,
    model_name="text-embedding-ada-002",
    batch_size=64,
)
print(f"Embedding shape: {embeddings.shape}")

# Step 3: Cluster with KMeans
num_clusters = 20
labels, centers = run_k_means(embeddings, num_clusters)

# Step 4: Find representative samples
topk_indices = get_topk_indices(centers, labels, embeddings, topk=5)

# Step 5: Display results
print_topk(texts, labels, topk_indices, show_cut_off=150)

# Step 6: Get structured cluster info for downstream use
cluster_info = get_cluster_info(texts, labels, topk_indices)
for cid, info in cluster_info.items():
    print(f"Cluster {cid}: {info['size']} conversations")

Related Pages

Principle:Lm_sys_FastChat_Prompt_Topic_Clustering
Implements: Principle:Lm_sys_FastChat_Prompt_Topic_Clustering
Lm_sys_FastChat_Summarize_Cluster - Generates human-readable topic labels from cluster output
Lm_sys_FastChat_Criteria_Labeling - Complementary labeling approach using predefined criteria
Lm_sys_FastChat_Deduplication - Deduplication that benefits from cluster-aware processing
Lm_sys_FastChat_Monitor_Markdown - Renders category-specific leaderboards informed by topic clusters

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment