Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:ChenghaoMou Text dedup Save Dataset

From Leeroopedia
Knowledge Sources
Domains Data_Engineering, Deduplication
Last Updated 2026-02-14 21:00 GMT

Overview

Concrete tool for saving deduplicated datasets and cluster mappings to disk provided by text-dedup.

Description

The save_dataset' function persists the final deduplicated Dataset to disk using HuggingFace's save_to_disk method. It optionally saves the cluster mapping as a pickle file, manages internal column removal (', __CLUSTER__) based on output configuration flags, and creates the output directory structure.

Usage

Import this function as the final step of any deduplication pipeline to persist results. Called by all four algorithm pipelines (MinHash, SimHash, Bloom Filter, Suffix Array).

Code Reference

Source Location

  • Repository: text-dedup
  • File: src/text_dedup/data_sources/io.py
  • Lines: L66-97

Signature

def save_dataset(
    config: Config,
    *,
    final_data: Dataset,
    clusters: dict[int, int],
    **kwargs: Any,
) -> None:
    """Save the dataset to disk.

    Parameters
    ----------
    config : Config
        Configuration with output settings.
    final_data : Dataset
        The deduplicated dataset to save.
    clusters : dict[int, int]
        Cluster mapping (document index → cluster representative).
    """

Import

from text_dedup.data_sources.io import save_dataset

I/O Contract

Inputs

Name Type Required Description
config Config Yes Configuration with output directory, save flags
final_data Dataset Yes The deduplicated dataset
clusters dict[int, int] Yes Cluster mapping (can be empty dict)

Outputs

Name Type Description
Saved dataset Directory HuggingFace Dataset saved to config.output.output_dir
clusters.pickle File Optional pickle of cluster mapping (if config.output.save_clusters is True)

Usage Examples

Saving After Deduplication

from text_dedup.data_sources.io import save_dataset

# After deduplication pipeline completes
save_dataset(
    config,
    final_data=deduplicated_ds,
    clusters=cluster_mapping,  # {doc_idx: cluster_representative}
)
# Output saved to config.output.output_dir

Related Pages

Implements Principle

Requires Environment

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment