Implementation:ChenghaoMou Text dedup SA Run Command
| Knowledge Sources | |
|---|---|
| Domains | Data_Structures, Deduplication |
| Last Updated | 2026-02-14 21:00 GMT |
Overview
Concrete tool for building suffix arrays and detecting self-similar regions via subprocess calls to Google Research external tools provided by text-dedup.
Description
The run_command method on SuffixArrayAlgorithmConfig executes external commands via subprocess to interface with Google Research's deduplicate-text-datasets toolset. It runs three sequential commands: (1) python scripts/make_suffix_array.py to build the suffix array from the concatenated text file, (2) cargo run self-similar to find self-similar regions using the suffix array with a length threshold, and (3) cargo run collect to collect duplicate byte ranges into an output file.
The method uses shell=True subprocess execution within the google_repo_path working directory.
Usage
Used internally by the suffix array pipeline. Requires the Google Research deduplicate-text-datasets repository to be cloned at the google_repo_path location with Rust tools compiled.
Code Reference
Source Location
- Repository: text-dedup
- File: src/text_dedup/config/algorithms/suffix_array.py
- Lines: L243-258
Signature
class SuffixArrayAlgorithmConfig(AlgorithmConfig):
algo_name: Literal["suffix_array"] = "suffix_array"
merge_strategy: Literal["longest", "overlapping"] = "longest"
length_threshold: int = 100
google_repo_path: str = "third_party/deduplicate-text-datasets"
cache_dir: str = ".cache"
def run_command(self, cmd: str, cwd: str) -> None:
"""Execute a shell command in the given working directory.
Parameters
----------
cmd : str
Shell command to execute.
cwd : str
Working directory for the command.
Raises
------
RuntimeError
If the command exits with non-zero code.
"""
Import
from text_dedup.config import SuffixArrayAlgorithmConfig
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| cmd | str | Yes | Shell command to execute (e.g., "python scripts/make_suffix_array.py ...") |
| cwd | str | Yes | Working directory (google_repo_path) |
| temp_text.txt | File | Yes | Concatenated document bytes file |
Outputs
| Name | Type | Description |
|---|---|---|
| Suffix array index | File | Binary suffix array index file (from make_suffix_array.py) |
| temp_output.txt | File | Duplicate byte ranges as (start, end) pairs (from cargo run collect) |
Usage Examples
Building Suffix Array and Finding Duplicates
from text_dedup.config.base import load_config_from_toml
from text_dedup.suffix_array import main
from pathlib import Path
# Full pipeline execution
config = load_config_from_toml(Path("configs/suffix_array.toml"))
main(config)
# Internally runs:
# algo.run_command("python scripts/make_suffix_array.py output/temp_text.txt", google_repo_path)
# algo.run_command("cargo run self-similar --data-file ... --length-threshold 100 ...", google_repo_path)
# algo.run_command("cargo run collect --data-file ... > output/temp_output.txt", google_repo_path)