Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Environment:Google research Deduplicate text datasets Rust Cargo Build Environment

From Leeroopedia
Knowledge Sources
Domains Infrastructure, Text_Deduplication
Last Updated 2026-02-14 21:00 GMT

Overview

Linux environment with Rust toolchain (Cargo) and a C compiler to build the `dedup_dataset` binary used by all deduplication commands.

Description

The core deduplication engine is written in Rust and must be compiled from source using `cargo build`. The resulting binary at `./target/debug/dedup_dataset` provides the CLI subcommands (`make`, `make-part`, `merge`, `self-similar`, `across-similar`, `collect`, `count-occurrences`) that all Python orchestration scripts depend on. The Cargo.toml declares Rust crate dependencies: `zstd` 0.5, `crossbeam` 0.3, `filebuffer` 0.4, `clap` 3.1.1, and `bitvec` 1. The debug profile uses `opt-level = 3` and disables overflow checks for performance.

Usage

Use this environment for all deduplication workflows. Every Python script in the repository invokes `./target/debug/dedup_dataset` as a subprocess. The Rust binary must be compiled before any pipeline step can execute. The `cargo build` step is explicitly called in `scripts/run_pipeline.sh` and is a prerequisite for `scripts/deduplicate_single_file.sh`.

System Requirements

Category Requirement Notes
OS Linux (Ubuntu recommended) A C compiler is required (`sudo apt-get install gcc`)
Hardware Multi-core CPU More cores enable faster parallel SA merge and self-similar steps
RAM 16GB minimum for small datasets (<10GB) C4-scale (300GB) requires 600GB+ RAM; the entire dataset must fit in memory
Disk Data size * pointer_width + data_size Suffix array `.table.bin` can be up to 8x the data file size

Dependencies

System Packages

  • `gcc` (C compiler, required by Rust linker)
  • `curl` (for rustup installation)

Rust Toolchain

  • Rust stable (installed via `rustup`)
  • Cargo (bundled with Rust)

Rust Crate Dependencies (Cargo.toml)

  • `zstd` = 0.5
  • `crossbeam` = 0.3
  • `filebuffer` = 0.4
  • `clap` = 3.1.1 (with `derive` feature)
  • `bitvec` = 1

Credentials

No credentials are required for the Rust build environment.

Quick Install

# Install Rust toolchain
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh

# Install C compiler (Ubuntu/Debian)
sudo apt-get install gcc

# Build the dedup_dataset binary
cargo build

Code Evidence

Hardcoded binary path used by all Python scripts, from `scripts/make_suffix_array.py:48`:

cmd = "./target/debug/dedup_dataset make-part --data-file %s --start-byte %d --end-byte %d"%(sys.argv[1], s, e)

Explicit `cargo build` in pipeline script, from `scripts/run_pipeline.sh:8`:

cargo build

Cargo.toml dependency declarations and performance-optimized debug profile from `Cargo.toml:7-16`:

[profile.dev]
opt-level = 3
overflow-checks = false  # Go FAAASSTTT!

[dependencies]
zstd = "0.5"
crossbeam = "0.3"
filebuffer = "0.4"
clap = { version = "3.1.1", features = ["derive"] }
bitvec = "1"

Common Errors

Error Message Cause Solution
`command not found: cargo` Rust toolchain not installed sh`
`linker 'cc' not found` No C compiler installed `sudo apt-get install gcc`
`./target/debug/dedup_dataset: No such file or directory` Binary not compiled Run `cargo build` before executing any Python script

Compatibility Notes

  • Performance: The debug profile sets `opt-level = 3` with overflow checks disabled, making the debug binary nearly as fast as a release build.
  • 64-bit only: The suffix array uses `u64` indices, requiring a 64-bit system. Datasets larger than 4GB cannot use 32-bit pointers.
  • Memory model: The entire source dataset must fit in RAM. The suffix array is streamed from disk and does not need to fit in memory.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment