Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Lm sys FastChat Summarize Cluster

From Leeroopedia


Knowledge Sources
Domains Data_Processing, Model_Evaluation
Last Updated 2026-02-07 06:00 GMT

Overview

Uses an LLM to generate human-readable topic summaries for conversation clusters, producing labeled topic distributions as JSONL output.

Description

Summarize Cluster is a post-processing module that transforms numeric cluster assignments into meaningful topic labels using large language model summarization. After conversations have been grouped into clusters by the topic clustering module, this module samples representative prompts from each cluster and asks an LLM to generate a concise topic label that describes the common theme.

The module operates as a main script that loads a pickle file containing clustered conversation data, iterates over each cluster, truncates representative prompts to a manageable length using the truncate_string helper function, and constructs a prompt asking the LLM to identify the overarching topic. The LLM's response is parsed to extract a short topic label and the percentage of conversations belonging to that cluster. Results are written to a JSONL file with one entry per cluster, containing the topic label and cluster size percentage.

This module bridges the gap between unsupervised clustering (which produces numeric group IDs) and human-interpretable category labels. The generated topic summaries are used in monitoring dashboards, research reports, and to inform the design of category-specific leaderboards.

Usage

Use this module after running topic clustering to convert numeric cluster IDs into meaningful topic labels. It requires an LLM API endpoint and a pickle file containing the clustered data from the topic clustering module.

Code Reference

Source Location

Signature

def truncate_string(s: str, l: int) -> str:
    """Truncate a string to a maximum length, appending '...' if truncated."""

# Main script:
# Loads cluster pkl, iterates clusters, queries LLM for topic labels,
# writes JSONL output with topic and percentage per cluster.

Import

from fastchat.serve.monitor.summarize_cluster import truncate_string

I/O Contract

Inputs

Name Type Required Description
cluster_pkl str Yes Path to a pickle file containing clustered conversation data (produced by topic_clustering.py)
s str Yes Input string to truncate (used by truncate_string)
l int Yes Maximum allowed string length (used by truncate_string)

Outputs

Name Type Description
output_jsonl file JSONL file where each line contains a JSON object with "topic" (str) and "percentage" (float) keys for each cluster
truncated str truncate_string returns the input string truncated to the specified length with "..." appended if needed

Output Format

Each line in the output JSONL file follows this structure:

{"cluster_id": 0, "topic": "Programming help with Python", "percentage": 12.5}
{"cluster_id": 1, "topic": "Creative writing and storytelling", "percentage": 8.3}
{"cluster_id": 2, "topic": "Math and science homework", "percentage": 7.1}

Usage Examples

from fastchat.serve.monitor.summarize_cluster import truncate_string

# Truncate a long prompt for LLM summarization context
long_prompt = "This is a very long user prompt that goes on and on..." * 20
truncated = truncate_string(long_prompt, 200)
print(truncated)  # First 200 characters followed by "..."

# Typical command-line usage:
# python -m fastchat.serve.monitor.summarize_cluster \
#     --input clusters.pkl \
#     --output topic_summaries.jsonl \
#     --model gpt-4

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment