Environment:ChenghaoMou Text dedup Suffix Array External Tools
| Knowledge Sources | |
|---|---|
| Domains | Infrastructure, Text_Deduplication, External_Tools |
| Last Updated | 2026-02-14 21:00 GMT |
Overview
External toolchain environment requiring Python, Rust/Cargo, and Google Research's `deduplicate-text-datasets` repository for suffix array deduplication.
Description
The Suffix Array deduplication workflow relies on an external tool: Google Research's `deduplicate-text-datasets`. This tool is written in Rust and must be compiled locally using Cargo. The workflow invokes external commands via `subprocess.Popen` with `shell=True`, requiring both Python (for the `make_suffix_array.py` script) and Cargo (for the `self-similar` and `collect` Rust binaries) to be available on the system PATH.
Usage
Use this environment only for the Suffix Array deduplication workflow. The other three algorithms (MinHash, SimHash, Bloom Filter) do not require these external tools. This environment is needed in addition to the base Python 3.12 environment.
System Requirements
| Category | Requirement | Notes |
|---|---|---|
| OS | Linux or macOS | Rust compilation required; Windows may need WSL |
| Hardware | CPU with multiple cores | `--num-threads` passed to Rust binary |
| Disk | SSD recommended | Suffix array construction is I/O intensive; temp files created in `output/` and `.cache/` directories |
| Software | Rust toolchain (rustc + cargo) | Required for compiling `deduplicate-text-datasets` |
| Software | Python 3.12+ | Required for `make_suffix_array.py` script |
Dependencies
System Packages
- Rust toolchain (`rustc` + `cargo`) — install via `rustup`
- Python >= 3.12 (for suffix array builder script)
- Git (to clone the Google Research repository)
External Repository
- `deduplicate-text-datasets` from Google Research
- Default path: `third_party/deduplicate-text-datasets` (configurable via `google_repo_path`)
- Must be cloned and built before running suffix array deduplication
Credentials
No credentials required. All operations are local.
Quick Install
# Install Rust toolchain
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
# Clone Google Research's deduplicate-text-datasets
mkdir -p third_party
git clone https://github.com/google-research/deduplicate-text-datasets.git third_party/deduplicate-text-datasets
# Build the Rust binaries
cd third_party/deduplicate-text-datasets
cargo build --release
cd ../..
Code Evidence
Default path configuration from `src/text_dedup/config/algorithms/suffix_array.py:17`:
google_repo_path: str = "third_party/deduplicate-text-datasets"
Subprocess execution with shell from `src/text_dedup/config/algorithms/suffix_array.py:243-258`:
def run_command(self, cmd: str, cwd: str) -> None:
p = subprocess.Popen(
cmd,
shell=True,
cwd=cwd,
stdout=subprocess.PIPE,
stderr=subprocess.PIPE,
)
stdout, stderr = p.communicate()
if p.returncode != 0:
error_msg = f"Command {cmd} failed with code {p.returncode}. CWD: {cwd}"
if stdout:
error_msg += f"\nstdout:\n{stdout.decode(errors='replace')}"
if stderr:
error_msg += f"\nstderr:\n{stderr.decode(errors='replace')}"
raise RuntimeError(error_msg)
External commands invoked from `src/text_dedup/suffix_array.py:59-75`:
with timer("Making suffix array", enable_spin=True):
algo.run_command(
f"python scripts/make_suffix_array.py {temp_text.relative_to(algo.google_repo_path)}",
algo.google_repo_path,
)
with timer("SelfSimilar", enable_spin=True):
algo.run_command(
f"cargo run self-similar --data-file {temp_text.relative_to(algo.google_repo_path)}"
f" --length-threshold {algo.length_threshold} --cache-dir {cache_dir.relative_to(algo.google_repo_path)} --num-threads {algo.num_proc}",
algo.google_repo_path,
)
algo.run_command(
f"cargo run collect --data-file {temp_text.relative_to(algo.google_repo_path)}"
f" --length-threshold {algo.length_threshold} --cache-dir {cache_dir.relative_to(algo.google_repo_path)} >"
f" {temp_output.relative_to(algo.google_repo_path)}",
algo.google_repo_path,
)
Common Errors
| Error Message | Cause | Solution |
|---|---|---|
| `RuntimeError: Command ... failed with code 127` | Cargo or Python not found on PATH | Install Rust via `rustup` and ensure Python is on PATH |
| `RuntimeError: Command cargo run self-similar ... failed` | Rust binary not compiled | Run `cargo build --release` in the `deduplicate-text-datasets` directory |
| `FileNotFoundError: third_party/deduplicate-text-datasets` | Google repo not cloned | Clone the repository to the expected path |
Compatibility Notes
- Linux/macOS only: The `shell=True` subprocess calls assume a POSIX shell. Windows users should use WSL.
- Disk I/O intensive: Suffix array construction writes temporary files to `output/` and `.cache/` directories within the Google repo path. Use SSD storage for better performance.
- Temp file cleanup: Set `clean_cache=True` in output config to auto-clean temporary directories after processing.