Implementation:ChenghaoMou Text dedup SA Run Command

Knowledge Sources	text-dedup deduplicate-text-datasets
Domains	Data_Structures, Deduplication
Last Updated	2026-02-14 21:00 GMT

Overview

Concrete tool for building suffix arrays and detecting self-similar regions via subprocess calls to Google Research external tools provided by text-dedup.

Description

The run_command method on SuffixArrayAlgorithmConfig executes external commands via subprocess to interface with Google Research's deduplicate-text-datasets toolset. It runs three sequential commands: (1) python scripts/make_suffix_array.py to build the suffix array from the concatenated text file, (2) cargo run self-similar to find self-similar regions using the suffix array with a length threshold, and (3) cargo run collect to collect duplicate byte ranges into an output file.

The method uses shell=True subprocess execution within the google_repo_path working directory.

Usage

Used internally by the suffix array pipeline. Requires the Google Research deduplicate-text-datasets repository to be cloned at the google_repo_path location with Rust tools compiled.

Code Reference

Source Location

Repository: text-dedup
File: src/text_dedup/config/algorithms/suffix_array.py
Lines: L243-258

Signature

class SuffixArrayAlgorithmConfig(AlgorithmConfig):
    algo_name: Literal["suffix_array"] = "suffix_array"
    merge_strategy: Literal["longest", "overlapping"] = "longest"
    length_threshold: int = 100
    google_repo_path: str = "third_party/deduplicate-text-datasets"
    cache_dir: str = ".cache"

    def run_command(self, cmd: str, cwd: str) -> None:
        """Execute a shell command in the given working directory.

        Parameters
        ----------
        cmd : str
            Shell command to execute.
        cwd : str
            Working directory for the command.

        Raises
        ------
        RuntimeError
            If the command exits with non-zero code.
        """

Import

from text_dedup.config import SuffixArrayAlgorithmConfig

I/O Contract

Inputs

Name	Type	Required	Description
cmd	str	Yes	Shell command to execute (e.g., "python scripts/make_suffix_array.py ...")
cwd	str	Yes	Working directory (google_repo_path)
temp_text.txt	File	Yes	Concatenated document bytes file

Outputs

Name	Type	Description
Suffix array index	File	Binary suffix array index file (from make_suffix_array.py)
temp_output.txt	File	Duplicate byte ranges as (start, end) pairs (from cargo run collect)

Usage Examples

Building Suffix Array and Finding Duplicates

from text_dedup.config.base import load_config_from_toml
from text_dedup.suffix_array import main
from pathlib import Path

# Full pipeline execution
config = load_config_from_toml(Path("configs/suffix_array.toml"))
main(config)

# Internally runs:
# algo.run_command("python scripts/make_suffix_array.py output/temp_text.txt", google_repo_path)
# algo.run_command("cargo run self-similar --data-file ... --length-threshold 100 ...", google_repo_path)
# algo.run_command("cargo run collect --data-file ... > output/temp_output.txt", google_repo_path)

Related Pages

Implements Principle

Principle:ChenghaoMou_Text_dedup_Suffix_Array_Construction

Requires Environment

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment