Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:ChenghaoMou Text dedup ClusterVisualizer

From Leeroopedia
Knowledge Sources
Domains Visualization, Reporting, Deduplication
Last Updated 2026-02-14 21:00 GMT

Overview

Concrete tool for interactive visualization and exploration of text deduplication cluster results provided by the text-dedup report module.

Description

The ClusterVisualizer class and create_gradio_app factory function provide a multi-tab Gradio web application for inspecting deduplication output. The ClusterVisualizer loads a deduplicated HuggingFace Dataset from disk along with pickled cluster parent mappings, computes summary statistics (total records, clusters, deduplication rate), and renders interactive Plotly visualizations including histograms, box plots, cumulative distributions, treemaps, and log-log scatter plots. The Gradio UI offers tabs for dataset loading, cluster distribution analysis, detailed analysis, treemap visualization, top cluster inspection, individual cluster exploration with Jaccard similarity computation, full-text search across records, and side-by-side cluster comparison.

Usage

Import this module when you need to visually inspect the quality and distribution of deduplication results after running a MinHash, SimHash, Bloom filter, or suffix array deduplication pipeline. The Gradio app is the primary user-facing reporting interface for the text-dedup project.

Code Reference

Source Location

Signature

class ClusterVisualizer:
    def __init__(self) -> None:
        """Initialize with empty state. Dataset and clusters are loaded via load_dataset."""

    def load_dataset(
        self,
        output_dir: str,
        text_column: str = "text",
        cluster_column: str = "__CLUSTER__",
        index_column: str = "__INDEX__",
    ) -> tuple[pd.DataFrame | None, str, dict[str, Any]]:
        """Load dataset and compute cluster statistics."""

    def get_summary_stats(self) -> pd.DataFrame | None:
        """Get summary statistics about the dataset."""

    def plot_cluster_distribution(
        self, cluster_size_slider: tuple[int, int]
    ) -> go.Figure | None:
        """Plot interactive histogram of cluster sizes using Plotly."""

    def plot_detailed_distribution(self, bin_size: int = 10) -> go.Figure | None:
        """Create box plot, cumulative distribution, size range, and top-50 charts."""

    def plot_cluster_treemap(self, max_clusters: int = 100) -> go.Figure | None:
        """Create a treemap visualization of cluster sizes."""

    def get_top_clusters(self, n: int = 20) -> pd.DataFrame | None:
        """Get top N clusters by size."""

    def explore_cluster(
        self, cluster_id: int, max_samples: int = 10
    ) -> tuple[pd.DataFrame | None, str]:
        """Get samples from a specific cluster with optional Jaccard similarity stats."""

    def search_text(
        self, query: str, max_results: int = 20
    ) -> tuple[pd.DataFrame | None, str]:
        """Search for text across all records."""

    def compare_clusters(
        self, cluster_id_1: int, cluster_id_2: int, max_samples: int = 10
    ) -> tuple[pd.DataFrame | None, pd.DataFrame | None, str]:
        """Compare two clusters by showing samples from each."""


def create_gradio_app() -> gr.Blocks:
    """Factory function that builds and returns the complete Gradio Blocks application."""

Import

from report.gradio_app import ClusterVisualizer, create_gradio_app

I/O Contract

Inputs

Name Type Required Description
output_dir str Yes Path to the deduplicated dataset directory (contains HuggingFace Dataset + clusters.pickle)
text_column str No Name of the text column in the dataset (default: "text")
cluster_column str No Name of the cluster assignment column (default: "__CLUSTER__")
index_column str No Name of the internal index column (default: "")

Outputs

Name Type Description
Gradio app gr.Blocks Interactive web application served on localhost
Summary statistics pd.DataFrame Table of deduplication metrics (total records, clusters, dedup rate, etc.)
Distribution plots go.Figure Interactive Plotly charts (histograms, box plots, treemaps, scatter)
Cluster samples pd.DataFrame Text previews from individual clusters
Search results pd.DataFrame Records matching a text search query

Usage Examples

Launch the Gradio App Directly

from report.gradio_app import create_gradio_app

# Create and launch the interactive web app
app = create_gradio_app()
app.launch(share=False, server_name="127.0.0.1", server_port=7860)

Use ClusterVisualizer Programmatically

from report.gradio_app import ClusterVisualizer

visualizer = ClusterVisualizer()

# Load a deduplicated dataset
stats, status, slider_update = visualizer.load_dataset(
    output_dir="./output",
    text_column="text",
    cluster_column="__CLUSTER__",
    index_column="__INDEX__",
)
print(status)  # "Dataset loaded successfully!"
print(stats)   # DataFrame with dedup metrics

# Get top clusters
top_df = visualizer.get_top_clusters(n=20)
print(top_df)

# Explore a specific cluster
samples, info = visualizer.explore_cluster(cluster_id=42, max_samples=10)
print(info)
print(samples)

# Search for text
results, search_status = visualizer.search_text("example query", max_results=20)
print(search_status)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment