Heuristic:Google research Deduplicate text datasets Variable Width Pointer Optimization

Knowledge Sources	deduplicate-text-datasets Internal
Domains	Optimization, Text_Deduplication
Last Updated	2026-02-14 21:00 GMT

Overview

Disk space optimization using variable-width suffix array pointers (`ceil(log2(file_size)/8)` bytes per entry) instead of fixed 8-byte pointers.

Description

Suffix arrays store one pointer per byte of the original dataset. A naive implementation would use 8 bytes (u64) per pointer, resulting in a suffix array 8x the size of the input. This heuristic computes the minimum pointer width needed to address all bytes in the dataset: `ceil(log2(data_size) / 8)`. For a 100MB file, only 4 bytes are needed per pointer (32-bit), halving the disk usage compared to the naive approach. For C4-scale data (~300GB), 5 bytes (40-bit) suffice. The Rust code always uses u64 in memory but serializes only the required bytes to disk.

Usage

This optimization is applied automatically during suffix array construction (`make`, `make-part`). It affects the `.table.bin` file size on disk and the format of intermediate `dups` files. Downstream consumers (e.g., `self-similar`, `across-similar`, `count-occurrences`) detect the pointer width by computing `size_table / size_text`. No user action is required, but understanding this is critical when debugging suffix array files or interpreting the binary format.

The Insight (Rule of Thumb)

Action: Suffix array pointer width is computed as `ceil(log2(data_size) / 8)` bytes.
Value: 3 bytes for files up to 16MB, 4 bytes up to 4GB, 5 bytes up to 1TB, 6 bytes up to 256TB.
Trade-off: Saves 50-75% disk space on suffix array tables. No runtime cost since in-memory representation always uses u64. A slight cost is paid for packing/unpacking during I/O.
Validation: The code asserts that `size_table % size_text == 0` to verify the pointer width is consistent.

Reasoning

For large-scale text deduplication, the suffix array table is the largest file artifact. A 300GB dataset (C4) would produce a 2.4TB suffix array at 8 bytes per pointer, but only ~1.5TB at 5 bytes per pointer. This is a significant difference when working with limited disk space. The variable-width encoding exploits the fact that most datasets do not need the full 64-bit address space.

The code also uses this optimization for the intermediate `dups` files produced by `self-similar` and `across-similar`, which store pointers to duplicate locations. The README explains: "All pointers are the same size, but the size of the pointers depends on the size of the dataset. We use the smallest pointer size that could address the entire dataset."

Code Evidence

Pointer width calculation in Rust from `src/main.rs:571`:

let ratio = ((text.len() as f64).log2()/8.0).ceil() as usize;

Same calculation in Python for validation from `scripts/make_suffix_array.py:72`:

FACT = np.ceil(np.log(size_data)/np.log(2)/8)

Width detection at load time from `src/main.rs:857-859`:

assert!(metadata.len() % (text.len() as u64) == 0);
let ratio = metadata.len()/(text.len() as u64);

Serialization using only required bytes from `src/main.rs:175-181`:

pub fn to_bytes(input: &[u64], size_width: usize) -> Vec<u8> {
    let mut bytes = Vec::with_capacity(size_width * input.len());
    for value in input {
        bytes.extend(&value.to_le_bytes()[..size_width]);
    }
    bytes
}

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment