Implementation:ChenghaoMou Text dedup Save Dataset
Appearance
| Knowledge Sources | |
|---|---|
| Domains | Data_Engineering, Deduplication |
| Last Updated | 2026-02-14 21:00 GMT |
Overview
Concrete tool for saving deduplicated datasets and cluster mappings to disk provided by text-dedup.
Description
The save_dataset' function persists the final deduplicated Dataset to disk using HuggingFace's save_to_disk method. It optionally saves the cluster mapping as a pickle file, manages internal column removal (', __CLUSTER__) based on output configuration flags, and creates the output directory structure.
Usage
Import this function as the final step of any deduplication pipeline to persist results. Called by all four algorithm pipelines (MinHash, SimHash, Bloom Filter, Suffix Array).
Code Reference
Source Location
- Repository: text-dedup
- File: src/text_dedup/data_sources/io.py
- Lines: L66-97
Signature
def save_dataset(
config: Config,
*,
final_data: Dataset,
clusters: dict[int, int],
**kwargs: Any,
) -> None:
"""Save the dataset to disk.
Parameters
----------
config : Config
Configuration with output settings.
final_data : Dataset
The deduplicated dataset to save.
clusters : dict[int, int]
Cluster mapping (document index → cluster representative).
"""
Import
from text_dedup.data_sources.io import save_dataset
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| config | Config | Yes | Configuration with output directory, save flags |
| final_data | Dataset | Yes | The deduplicated dataset |
| clusters | dict[int, int] | Yes | Cluster mapping (can be empty dict) |
Outputs
| Name | Type | Description |
|---|---|---|
| Saved dataset | Directory | HuggingFace Dataset saved to config.output.output_dir |
| clusters.pickle | File | Optional pickle of cluster mapping (if config.output.save_clusters is True) |
Usage Examples
Saving After Deduplication
from text_dedup.data_sources.io import save_dataset
# After deduplication pipeline completes
save_dataset(
config,
final_data=deduplicated_ds,
clusters=cluster_mapping, # {doc_idx: cluster_representative}
)
# Output saved to config.output.output_dir
Related Pages
Implements Principle
Requires Environment
Page Connections
Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment