Implementation:ChenghaoMou Text dedup ClusterVisualizer
| Knowledge Sources | |
|---|---|
| Domains | Visualization, Reporting, Deduplication |
| Last Updated | 2026-02-14 21:00 GMT |
Overview
Concrete tool for interactive visualization and exploration of text deduplication cluster results provided by the text-dedup report module.
Description
The ClusterVisualizer class and create_gradio_app factory function provide a multi-tab Gradio web application for inspecting deduplication output. The ClusterVisualizer loads a deduplicated HuggingFace Dataset from disk along with pickled cluster parent mappings, computes summary statistics (total records, clusters, deduplication rate), and renders interactive Plotly visualizations including histograms, box plots, cumulative distributions, treemaps, and log-log scatter plots. The Gradio UI offers tabs for dataset loading, cluster distribution analysis, detailed analysis, treemap visualization, top cluster inspection, individual cluster exploration with Jaccard similarity computation, full-text search across records, and side-by-side cluster comparison.
Usage
Import this module when you need to visually inspect the quality and distribution of deduplication results after running a MinHash, SimHash, Bloom filter, or suffix array deduplication pipeline. The Gradio app is the primary user-facing reporting interface for the text-dedup project.
Code Reference
Source Location
- Repository: ChenghaoMou_Text_dedup
- File: report/gradio_app.py
- Lines: 1-583
Signature
class ClusterVisualizer:
def __init__(self) -> None:
"""Initialize with empty state. Dataset and clusters are loaded via load_dataset."""
def load_dataset(
self,
output_dir: str,
text_column: str = "text",
cluster_column: str = "__CLUSTER__",
index_column: str = "__INDEX__",
) -> tuple[pd.DataFrame | None, str, dict[str, Any]]:
"""Load dataset and compute cluster statistics."""
def get_summary_stats(self) -> pd.DataFrame | None:
"""Get summary statistics about the dataset."""
def plot_cluster_distribution(
self, cluster_size_slider: tuple[int, int]
) -> go.Figure | None:
"""Plot interactive histogram of cluster sizes using Plotly."""
def plot_detailed_distribution(self, bin_size: int = 10) -> go.Figure | None:
"""Create box plot, cumulative distribution, size range, and top-50 charts."""
def plot_cluster_treemap(self, max_clusters: int = 100) -> go.Figure | None:
"""Create a treemap visualization of cluster sizes."""
def get_top_clusters(self, n: int = 20) -> pd.DataFrame | None:
"""Get top N clusters by size."""
def explore_cluster(
self, cluster_id: int, max_samples: int = 10
) -> tuple[pd.DataFrame | None, str]:
"""Get samples from a specific cluster with optional Jaccard similarity stats."""
def search_text(
self, query: str, max_results: int = 20
) -> tuple[pd.DataFrame | None, str]:
"""Search for text across all records."""
def compare_clusters(
self, cluster_id_1: int, cluster_id_2: int, max_samples: int = 10
) -> tuple[pd.DataFrame | None, pd.DataFrame | None, str]:
"""Compare two clusters by showing samples from each."""
def create_gradio_app() -> gr.Blocks:
"""Factory function that builds and returns the complete Gradio Blocks application."""
Import
from report.gradio_app import ClusterVisualizer, create_gradio_app
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| output_dir | str | Yes | Path to the deduplicated dataset directory (contains HuggingFace Dataset + clusters.pickle) |
| text_column | str | No | Name of the text column in the dataset (default: "text") |
| cluster_column | str | No | Name of the cluster assignment column (default: "__CLUSTER__") |
| index_column | str | No | Name of the internal index column (default: "") |
Outputs
| Name | Type | Description |
|---|---|---|
| Gradio app | gr.Blocks | Interactive web application served on localhost |
| Summary statistics | pd.DataFrame | Table of deduplication metrics (total records, clusters, dedup rate, etc.) |
| Distribution plots | go.Figure | Interactive Plotly charts (histograms, box plots, treemaps, scatter) |
| Cluster samples | pd.DataFrame | Text previews from individual clusters |
| Search results | pd.DataFrame | Records matching a text search query |
Usage Examples
Launch the Gradio App Directly
from report.gradio_app import create_gradio_app
# Create and launch the interactive web app
app = create_gradio_app()
app.launch(share=False, server_name="127.0.0.1", server_port=7860)
Use ClusterVisualizer Programmatically
from report.gradio_app import ClusterVisualizer
visualizer = ClusterVisualizer()
# Load a deduplicated dataset
stats, status, slider_update = visualizer.load_dataset(
output_dir="./output",
text_column="text",
cluster_column="__CLUSTER__",
index_column="__INDEX__",
)
print(status) # "Dataset loaded successfully!"
print(stats) # DataFrame with dedup metrics
# Get top clusters
top_df = visualizer.get_top_clusters(n=20)
print(top_df)
# Explore a specific cluster
samples, info = visualizer.explore_cluster(cluster_id=42, max_samples=10)
print(info)
print(samples)
# Search for text
results, search_status = visualizer.search_text("example query", max_results=20)
print(search_status)