Environment:Google research Deduplicate text datasets Rust Cargo Build Environment
| Knowledge Sources | |
|---|---|
| Domains | Infrastructure, Text_Deduplication |
| Last Updated | 2026-02-14 21:00 GMT |
Overview
Linux environment with Rust toolchain (Cargo) and a C compiler to build the `dedup_dataset` binary used by all deduplication commands.
Description
The core deduplication engine is written in Rust and must be compiled from source using `cargo build`. The resulting binary at `./target/debug/dedup_dataset` provides the CLI subcommands (`make`, `make-part`, `merge`, `self-similar`, `across-similar`, `collect`, `count-occurrences`) that all Python orchestration scripts depend on. The Cargo.toml declares Rust crate dependencies: `zstd` 0.5, `crossbeam` 0.3, `filebuffer` 0.4, `clap` 3.1.1, and `bitvec` 1. The debug profile uses `opt-level = 3` and disables overflow checks for performance.
Usage
Use this environment for all deduplication workflows. Every Python script in the repository invokes `./target/debug/dedup_dataset` as a subprocess. The Rust binary must be compiled before any pipeline step can execute. The `cargo build` step is explicitly called in `scripts/run_pipeline.sh` and is a prerequisite for `scripts/deduplicate_single_file.sh`.
System Requirements
| Category | Requirement | Notes |
|---|---|---|
| OS | Linux (Ubuntu recommended) | A C compiler is required (`sudo apt-get install gcc`) |
| Hardware | Multi-core CPU | More cores enable faster parallel SA merge and self-similar steps |
| RAM | 16GB minimum for small datasets (<10GB) | C4-scale (300GB) requires 600GB+ RAM; the entire dataset must fit in memory |
| Disk | Data size * pointer_width + data_size | Suffix array `.table.bin` can be up to 8x the data file size |
Dependencies
System Packages
- `gcc` (C compiler, required by Rust linker)
- `curl` (for rustup installation)
Rust Toolchain
- Rust stable (installed via `rustup`)
- Cargo (bundled with Rust)
Rust Crate Dependencies (Cargo.toml)
- `zstd` = 0.5
- `crossbeam` = 0.3
- `filebuffer` = 0.4
- `clap` = 3.1.1 (with `derive` feature)
- `bitvec` = 1
Credentials
No credentials are required for the Rust build environment.
Quick Install
# Install Rust toolchain
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
# Install C compiler (Ubuntu/Debian)
sudo apt-get install gcc
# Build the dedup_dataset binary
cargo build
Code Evidence
Hardcoded binary path used by all Python scripts, from `scripts/make_suffix_array.py:48`:
cmd = "./target/debug/dedup_dataset make-part --data-file %s --start-byte %d --end-byte %d"%(sys.argv[1], s, e)
Explicit `cargo build` in pipeline script, from `scripts/run_pipeline.sh:8`:
cargo build
Cargo.toml dependency declarations and performance-optimized debug profile from `Cargo.toml:7-16`:
[profile.dev]
opt-level = 3
overflow-checks = false # Go FAAASSTTT!
[dependencies]
zstd = "0.5"
crossbeam = "0.3"
filebuffer = "0.4"
clap = { version = "3.1.1", features = ["derive"] }
bitvec = "1"
Common Errors
| Error Message | Cause | Solution |
|---|---|---|
| `command not found: cargo` | Rust toolchain not installed | sh` |
| `linker 'cc' not found` | No C compiler installed | `sudo apt-get install gcc` |
| `./target/debug/dedup_dataset: No such file or directory` | Binary not compiled | Run `cargo build` before executing any Python script |
Compatibility Notes
- Performance: The debug profile sets `opt-level = 3` with overflow checks disabled, making the debug binary nearly as fast as a release build.
- 64-bit only: The suffix array uses `u64` indices, requiring a 64-bit system. Datasets larger than 4GB cannot use 32-bit pointers.
- Memory model: The entire source dataset must fit in RAM. The suffix array is streamed from disk and does not need to fit in memory.
Related Pages
- Implementation:Google_research_Deduplicate_text_datasets_Make_Suffix_Array
- Implementation:Google_research_Deduplicate_text_datasets_Cmd_Self_Similar
- Implementation:Google_research_Deduplicate_text_datasets_Cmd_Collect
- Implementation:Google_research_Deduplicate_text_datasets_Cmd_Across_Similar
- Implementation:Google_research_Deduplicate_text_datasets_Count_Occurrences