Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Lm sys FastChat Topic Clustering

From Leeroopedia


Knowledge Sources
Domains Data_Processing, Model_Evaluation
Last Updated 2026-02-07 06:00 GMT

Overview

Clusters user prompts from FastChat conversations using text embeddings and multiple clustering algorithms, supporting both OpenAI and SentenceTransformer embedding models.

Description

Topic Clustering is a comprehensive text clustering module that groups user prompts into topical clusters using embedding-based similarity. The module implements an end-to-end pipeline: reading and filtering text data, generating embeddings, running clustering algorithms, and extracting representative samples from each cluster. It supports three clustering algorithms (KMeans, AgglomerativeClustering, and HDBSCAN) and two embedding backends (OpenAI API embeddings and local SentenceTransformer models).

The pipeline begins with read_texts, which loads prompts from an input file and applies configurable filters for minimum and maximum text length as well as optional English-only filtering. The get_embeddings function then converts the filtered texts into dense vector representations using either the OpenAI embedding API or a locally loaded SentenceTransformer model, processing texts in configurable batch sizes for memory efficiency.

The clustering stage offers three algorithms suited to different use cases. run_k_means provides fast, scalable clustering when the number of clusters is known in advance. run_agg_cluster uses hierarchical agglomerative clustering for more nuanced cluster structures. run_hdbscan_cluster performs density-based clustering that automatically determines the number of clusters and identifies noise points. After clustering, get_topk_indices finds the most representative examples from each cluster (those closest to cluster centers), and get_cluster_info aggregates cluster metadata including size, representative samples, and centroid information.

Usage

Use this module to discover what topics users are asking about in the FastChat arena. It is typically run periodically on accumulated conversation data to identify trending topics, inform category design for leaderboards, and provide input to the summarize_cluster module for human-readable topic labels.

Code Reference

Source Location

Signature

def read_texts(
    input_file: str,
    min_length: int = 0,
    max_length: int = 1000,
    english_only: bool = False
) -> list:
    """Read and filter text prompts from an input file."""

def get_embeddings(
    texts: list,
    model_name: str = "text-embedding-ada-002",
    batch_size: int = 128
) -> np.ndarray:
    """Generate embeddings for a list of texts using OpenAI or SentenceTransformer models."""

def run_k_means(embeddings: np.ndarray, num_clusters: int) -> tuple:
    """Run KMeans clustering on embeddings, returning labels and cluster centers."""

def run_agg_cluster(embeddings: np.ndarray, num_clusters: int) -> np.ndarray:
    """Run Agglomerative Clustering on embeddings, returning cluster labels."""

def run_hdbscan_cluster(embeddings: np.ndarray) -> np.ndarray:
    """Run HDBSCAN density-based clustering on embeddings, returning cluster labels."""

def get_topk_indices(
    centers: np.ndarray,
    labels: np.ndarray,
    embeddings: np.ndarray,
    topk: int
) -> dict:
    """Find the top-k most representative sample indices for each cluster."""

def print_topk(
    texts: list,
    labels: np.ndarray,
    topk_indices: dict,
    show_cut_off: int = 200
) -> None:
    """Print the top-k representative samples for each cluster to stdout."""

def get_cluster_info(
    texts: list,
    labels: np.ndarray,
    topk_indices: dict
) -> dict:
    """Aggregate cluster metadata including sizes and representative texts."""

Import

from fastchat.serve.monitor.topic_clustering import run_k_means, get_embeddings

I/O Contract

Inputs

Name Type Required Description
input_file str Yes Path to a file containing text prompts (one per line or JSONL)
min_length int No Minimum text length to include (default: 0)
max_length int No Maximum text length to include (default: 1000)
english_only bool No Whether to filter for English-only texts (default: False)
texts list[str] Yes List of text strings to embed (used by get_embeddings)
model_name str No Embedding model identifier (default: "text-embedding-ada-002")
batch_size int No Number of texts per embedding API call (default: 128)
embeddings np.ndarray Yes Embedding matrix of shape (n_samples, embedding_dim) (used by clustering functions)
num_clusters int Yes Number of clusters for KMeans and Agglomerative (not used by HDBSCAN)
centers np.ndarray Yes Cluster center vectors (used by get_topk_indices)
labels np.ndarray Yes Cluster assignment labels per sample
topk int Yes Number of representative samples per cluster (used by get_topk_indices)

Outputs

Name Type Description
texts list[str] read_texts returns filtered list of prompt strings
embeddings np.ndarray get_embeddings returns array of shape (n_texts, embedding_dim)
labels, centers tuple[np.ndarray, np.ndarray] run_k_means returns cluster labels and center vectors
labels np.ndarray run_agg_cluster and run_hdbscan_cluster return cluster label arrays
topk_indices dict[int, list[int]] get_topk_indices returns a mapping from cluster ID to list of sample indices
cluster_info dict get_cluster_info returns a dictionary with cluster sizes, representative texts, and metadata

Supported Algorithms

Algorithm Function Requires num_clusters Notes
KMeans run_k_means Yes Fast and scalable; returns both labels and cluster centers
Agglomerative run_agg_cluster Yes Hierarchical clustering; better for non-spherical clusters
HDBSCAN run_hdbscan_cluster No Density-based; automatically determines cluster count; identifies noise points as label -1

Supported Embedding Models

Backend Model Example Notes
OpenAI API text-embedding-ada-002 Requires API key; high quality; remote API call
SentenceTransformer all-MiniLM-L6-v2 Local execution; no API key needed; configurable models

Usage Examples

from fastchat.serve.monitor.topic_clustering import (
    read_texts,
    get_embeddings,
    run_k_means,
    get_topk_indices,
    print_topk,
    get_cluster_info,
)

# Step 1: Read and filter prompts
texts = read_texts(
    "conversations.jsonl",
    min_length=10,
    max_length=500,
    english_only=True,
)
print(f"Loaded {len(texts)} prompts after filtering")

# Step 2: Generate embeddings
embeddings = get_embeddings(
    texts,
    model_name="text-embedding-ada-002",
    batch_size=64,
)
print(f"Embedding shape: {embeddings.shape}")

# Step 3: Cluster with KMeans
num_clusters = 20
labels, centers = run_k_means(embeddings, num_clusters)

# Step 4: Find representative samples
topk_indices = get_topk_indices(centers, labels, embeddings, topk=5)

# Step 5: Display results
print_topk(texts, labels, topk_indices, show_cut_off=150)

# Step 6: Get structured cluster info for downstream use
cluster_info = get_cluster_info(texts, labels, topk_indices)
for cid, info in cluster_info.items():
    print(f"Cluster {cid}: {info['size']} conversations")

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment