Environment:Google research Deduplicate text datasets Rust Cargo Build Environment

Knowledge Sources	deduplicate-text-datasets Rust Install
Domains	Infrastructure, Text_Deduplication
Last Updated	2026-02-14 21:00 GMT

Overview

Linux environment with Rust toolchain (Cargo) and a C compiler to build the `dedup_dataset` binary used by all deduplication commands.

Description

The core deduplication engine is written in Rust and must be compiled from source using `cargo build`. The resulting binary at `./target/debug/dedup_dataset` provides the CLI subcommands (`make`, `make-part`, `merge`, `self-similar`, `across-similar`, `collect`, `count-occurrences`) that all Python orchestration scripts depend on. The Cargo.toml declares Rust crate dependencies: `zstd` 0.5, `crossbeam` 0.3, `filebuffer` 0.4, `clap` 3.1.1, and `bitvec` 1. The debug profile uses `opt-level = 3` and disables overflow checks for performance.

Usage

Use this environment for all deduplication workflows. Every Python script in the repository invokes `./target/debug/dedup_dataset` as a subprocess. The Rust binary must be compiled before any pipeline step can execute. The `cargo build` step is explicitly called in `scripts/run_pipeline.sh` and is a prerequisite for `scripts/deduplicate_single_file.sh`.

System Requirements

Category	Requirement	Notes
OS	Linux (Ubuntu recommended)	A C compiler is required (`sudo apt-get install gcc`)
Hardware	Multi-core CPU	More cores enable faster parallel SA merge and self-similar steps
RAM	16GB minimum for small datasets (<10GB)	C4-scale (300GB) requires 600GB+ RAM; the entire dataset must fit in memory
Disk	Data size * pointer_width + data_size	Suffix array `.table.bin` can be up to 8x the data file size

Dependencies

System Packages

`gcc` (C compiler, required by Rust linker)
`curl` (for rustup installation)

Rust Toolchain

Rust stable (installed via `rustup`)
Cargo (bundled with Rust)

Rust Crate Dependencies (Cargo.toml)

`zstd` = 0.5
`crossbeam` = 0.3
`filebuffer` = 0.4
`clap` = 3.1.1 (with `derive` feature)
`bitvec` = 1

Credentials

No credentials are required for the Rust build environment.

Quick Install

# Install Rust toolchain
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh

# Install C compiler (Ubuntu/Debian)
sudo apt-get install gcc

# Build the dedup_dataset binary
cargo build

Code Evidence

Hardcoded binary path used by all Python scripts, from `scripts/make_suffix_array.py:48`:

cmd = "./target/debug/dedup_dataset make-part --data-file %s --start-byte %d --end-byte %d"%(sys.argv[1], s, e)

Explicit `cargo build` in pipeline script, from `scripts/run_pipeline.sh:8`:

cargo build

Cargo.toml dependency declarations and performance-optimized debug profile from `Cargo.toml:7-16`:

[profile.dev]
opt-level = 3
overflow-checks = false  # Go FAAASSTTT!

[dependencies]
zstd = "0.5"
crossbeam = "0.3"
filebuffer = "0.4"
clap = { version = "3.1.1", features = ["derive"] }
bitvec = "1"

Common Errors

Error Message	Cause	Solution
`command not found: cargo`	Rust toolchain not installed	sh`
`linker 'cc' not found`	No C compiler installed	`sudo apt-get install gcc`
`./target/debug/dedup_dataset: No such file or directory`	Binary not compiled	Run `cargo build` before executing any Python script

Compatibility Notes

Performance: The debug profile sets `opt-level = 3` with overflow checks disabled, making the debug binary nearly as fast as a release build.
64-bit only: The suffix array uses `u64` indices, requiring a 64-bit system. Datasets larger than 4GB cannot use 32-bit pointers.
Memory model: The entire source dataset must fit in RAM. The suffix array is streamed from disk and does not need to fit in memory.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment