Implementation:ChenghaoMou Text dedup ClusterVisualizer

Knowledge Sources	ChenghaoMou_Text_dedup Gradio Docs Plotly Docs
Domains	Visualization, Reporting, Deduplication
Last Updated	2026-02-14 21:00 GMT

Overview

Concrete tool for interactive visualization and exploration of text deduplication cluster results provided by the text-dedup report module.

Description

The ClusterVisualizer class and create_gradio_app factory function provide a multi-tab Gradio web application for inspecting deduplication output. The ClusterVisualizer loads a deduplicated HuggingFace Dataset from disk along with pickled cluster parent mappings, computes summary statistics (total records, clusters, deduplication rate), and renders interactive Plotly visualizations including histograms, box plots, cumulative distributions, treemaps, and log-log scatter plots. The Gradio UI offers tabs for dataset loading, cluster distribution analysis, detailed analysis, treemap visualization, top cluster inspection, individual cluster exploration with Jaccard similarity computation, full-text search across records, and side-by-side cluster comparison.

Usage

Import this module when you need to visually inspect the quality and distribution of deduplication results after running a MinHash, SimHash, Bloom filter, or suffix array deduplication pipeline. The Gradio app is the primary user-facing reporting interface for the text-dedup project.

Code Reference

Source Location

Repository: ChenghaoMou_Text_dedup
File: report/gradio_app.py
Lines: 1-583

Signature

class ClusterVisualizer:
    def __init__(self) -> None:
        """Initialize with empty state. Dataset and clusters are loaded via load_dataset."""

    def load_dataset(
        self,
        output_dir: str,
        text_column: str = "text",
        cluster_column: str = "__CLUSTER__",
        index_column: str = "__INDEX__",
    ) -> tuple[pd.DataFrame | None, str, dict[str, Any]]:
        """Load dataset and compute cluster statistics."""

    def get_summary_stats(self) -> pd.DataFrame | None:
        """Get summary statistics about the dataset."""

    def plot_cluster_distribution(
        self, cluster_size_slider: tuple[int, int]
    ) -> go.Figure | None:
        """Plot interactive histogram of cluster sizes using Plotly."""

    def plot_detailed_distribution(self, bin_size: int = 10) -> go.Figure | None:
        """Create box plot, cumulative distribution, size range, and top-50 charts."""

    def plot_cluster_treemap(self, max_clusters: int = 100) -> go.Figure | None:
        """Create a treemap visualization of cluster sizes."""

    def get_top_clusters(self, n: int = 20) -> pd.DataFrame | None:
        """Get top N clusters by size."""

    def explore_cluster(
        self, cluster_id: int, max_samples: int = 10
    ) -> tuple[pd.DataFrame | None, str]:
        """Get samples from a specific cluster with optional Jaccard similarity stats."""

    def search_text(
        self, query: str, max_results: int = 20
    ) -> tuple[pd.DataFrame | None, str]:
        """Search for text across all records."""

    def compare_clusters(
        self, cluster_id_1: int, cluster_id_2: int, max_samples: int = 10
    ) -> tuple[pd.DataFrame | None, pd.DataFrame | None, str]:
        """Compare two clusters by showing samples from each."""


def create_gradio_app() -> gr.Blocks:
    """Factory function that builds and returns the complete Gradio Blocks application."""

Import

from report.gradio_app import ClusterVisualizer, create_gradio_app

I/O Contract

Inputs

Name	Type	Required	Description
output_dir	str	Yes	Path to the deduplicated dataset directory (contains HuggingFace Dataset + clusters.pickle)
text_column	str	No	Name of the text column in the dataset (default: "text")
cluster_column	str	No	Name of the cluster assignment column (default: "__CLUSTER__")
index_column	str	No	Name of the internal index column (default: "")

Outputs

Name	Type	Description
Gradio app	gr.Blocks	Interactive web application served on localhost
Summary statistics	pd.DataFrame	Table of deduplication metrics (total records, clusters, dedup rate, etc.)
Distribution plots	go.Figure	Interactive Plotly charts (histograms, box plots, treemaps, scatter)
Cluster samples	pd.DataFrame	Text previews from individual clusters
Search results	pd.DataFrame	Records matching a text search query

Usage Examples

Launch the Gradio App Directly

from report.gradio_app import create_gradio_app

# Create and launch the interactive web app
app = create_gradio_app()
app.launch(share=False, server_name="127.0.0.1", server_port=7860)

Use ClusterVisualizer Programmatically

from report.gradio_app import ClusterVisualizer

visualizer = ClusterVisualizer()

# Load a deduplicated dataset
stats, status, slider_update = visualizer.load_dataset(
    output_dir="./output",
    text_column="text",
    cluster_column="__CLUSTER__",
    index_column="__INDEX__",
)
print(status)  # "Dataset loaded successfully!"
print(stats)   # DataFrame with dedup metrics

# Get top clusters
top_df = visualizer.get_top_clusters(n=20)
print(top_df)

# Explore a specific cluster
samples, info = visualizer.explore_cluster(cluster_id=42, max_samples=10)
print(info)
print(samples)

# Search for text
results, search_status = visualizer.search_text("example query", max_results=20)
print(search_status)

Related Pages

Environment:ChenghaoMou_Text_dedup_Python_3_12_Environment

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment