Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Environment:ChenghaoMou Text dedup Suffix Array External Tools

From Leeroopedia
Knowledge Sources
Domains Infrastructure, Text_Deduplication, External_Tools
Last Updated 2026-02-14 21:00 GMT

Overview

External toolchain environment requiring Python, Rust/Cargo, and Google Research's `deduplicate-text-datasets` repository for suffix array deduplication.

Description

The Suffix Array deduplication workflow relies on an external tool: Google Research's `deduplicate-text-datasets`. This tool is written in Rust and must be compiled locally using Cargo. The workflow invokes external commands via `subprocess.Popen` with `shell=True`, requiring both Python (for the `make_suffix_array.py` script) and Cargo (for the `self-similar` and `collect` Rust binaries) to be available on the system PATH.

Usage

Use this environment only for the Suffix Array deduplication workflow. The other three algorithms (MinHash, SimHash, Bloom Filter) do not require these external tools. This environment is needed in addition to the base Python 3.12 environment.

System Requirements

Category Requirement Notes
OS Linux or macOS Rust compilation required; Windows may need WSL
Hardware CPU with multiple cores `--num-threads` passed to Rust binary
Disk SSD recommended Suffix array construction is I/O intensive; temp files created in `output/` and `.cache/` directories
Software Rust toolchain (rustc + cargo) Required for compiling `deduplicate-text-datasets`
Software Python 3.12+ Required for `make_suffix_array.py` script

Dependencies

System Packages

  • Rust toolchain (`rustc` + `cargo`) — install via `rustup`
  • Python >= 3.12 (for suffix array builder script)
  • Git (to clone the Google Research repository)

External Repository

  • `deduplicate-text-datasets` from Google Research
  • Default path: `third_party/deduplicate-text-datasets` (configurable via `google_repo_path`)
  • Must be cloned and built before running suffix array deduplication

Credentials

No credentials required. All operations are local.

Quick Install

# Install Rust toolchain
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh

# Clone Google Research's deduplicate-text-datasets
mkdir -p third_party
git clone https://github.com/google-research/deduplicate-text-datasets.git third_party/deduplicate-text-datasets

# Build the Rust binaries
cd third_party/deduplicate-text-datasets
cargo build --release
cd ../..

Code Evidence

Default path configuration from `src/text_dedup/config/algorithms/suffix_array.py:17`:

google_repo_path: str = "third_party/deduplicate-text-datasets"

Subprocess execution with shell from `src/text_dedup/config/algorithms/suffix_array.py:243-258`:

def run_command(self, cmd: str, cwd: str) -> None:
    p = subprocess.Popen(
        cmd,
        shell=True,
        cwd=cwd,
        stdout=subprocess.PIPE,
        stderr=subprocess.PIPE,
    )
    stdout, stderr = p.communicate()
    if p.returncode != 0:
        error_msg = f"Command {cmd} failed with code {p.returncode}. CWD: {cwd}"
        if stdout:
            error_msg += f"\nstdout:\n{stdout.decode(errors='replace')}"
        if stderr:
            error_msg += f"\nstderr:\n{stderr.decode(errors='replace')}"
        raise RuntimeError(error_msg)

External commands invoked from `src/text_dedup/suffix_array.py:59-75`:

with timer("Making suffix array", enable_spin=True):
    algo.run_command(
        f"python scripts/make_suffix_array.py {temp_text.relative_to(algo.google_repo_path)}",
        algo.google_repo_path,
    )

with timer("SelfSimilar", enable_spin=True):
    algo.run_command(
        f"cargo run self-similar --data-file {temp_text.relative_to(algo.google_repo_path)}"
        f" --length-threshold {algo.length_threshold} --cache-dir {cache_dir.relative_to(algo.google_repo_path)} --num-threads {algo.num_proc}",
        algo.google_repo_path,
    )
    algo.run_command(
        f"cargo run collect --data-file {temp_text.relative_to(algo.google_repo_path)}"
        f" --length-threshold {algo.length_threshold} --cache-dir {cache_dir.relative_to(algo.google_repo_path)} >"
        f" {temp_output.relative_to(algo.google_repo_path)}",
        algo.google_repo_path,
    )

Common Errors

Error Message Cause Solution
`RuntimeError: Command ... failed with code 127` Cargo or Python not found on PATH Install Rust via `rustup` and ensure Python is on PATH
`RuntimeError: Command cargo run self-similar ... failed` Rust binary not compiled Run `cargo build --release` in the `deduplicate-text-datasets` directory
`FileNotFoundError: third_party/deduplicate-text-datasets` Google repo not cloned Clone the repository to the expected path

Compatibility Notes

  • Linux/macOS only: The `shell=True` subprocess calls assume a POSIX shell. Windows users should use WSL.
  • Disk I/O intensive: Suffix array construction writes temporary files to `output/` and `.cache/` directories within the Google repo path. Use SSD storage for better performance.
  • Temp file cleanup: Set `clean_cache=True` in output config to auto-clean temporary directories after processing.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment