Implementation:Google research Deduplicate text datasets Count Occurrences

Knowledge Sources	deduplicate-text-datasets Deduplicating Training Data Makes Language Models Better
Domains	String_Algorithms, Data_Structures, NLP
Last Updated	2026-02-14 21:00 GMT

Overview

Concrete tool for counting substring occurrences in a dataset using suffix array binary search provided by the deduplicate-text-datasets repository.

Description

The substring query system has two layers: a Python wrapper (count_occurrences.py) that handles tokenization, query encoding, and temporary file management, and a Rust CLI subcommand (count-occurrences) that performs the actual binary search on the suffix array.

The Python wrapper supports three tokenizers (GPT-2 via transformers, T5 via transformers, and a custom tiktoken-based tokenizer). It encodes the query string (or file) into the appropriate byte representation, writes it to a temporary file (/tmp/fin), then invokes the Rust CLI.

The Rust subcommand loads the data file and suffix array, reads the query from the query file, and performs two binary searches to find the range of matching suffixes. It reports the count to stdout and optionally the byte location of the first match.

Usage

Use this tool when you need to check how many times a specific string or token sequence appears in a dataset. The dataset must already be serialized and have a suffix array built. Useful for verifying deduplication, checking data contamination, or analyzing corpus composition.

Code Reference

Source Location

Repository: deduplicate-text-datasets
File (Python wrapper): scripts/count_occurrences.py (L1-99)
File (Rust CLI args): src/main.rs (L95-104)
File (Rust implementation): src/main.rs (L634-672)

Signature

# Python wrapper (preferred entry point)
python3 scripts/count_occurrences.py \
    --suffix <data_path> \
    --query <query_string> \
    [--query_file <query_file_path>] \
    [--tokenize] \
    [--tokenizer gpt2|t5|mytik] \
    [--print_location] \
    [--load_disk]

# Underlying Rust subcommand (called by wrapper)
dedup_dataset count-occurrences \
    --data-file <path> \
    --query-file <query_file> \
    [--print-location] \
    [--load-disk]

Import

# CLI tool, not importable. Requires prior build:
cargo build
# Then invoke via the Python wrapper:
python3 scripts/count_occurrences.py --suffix <data_path> --query "search string"

I/O Contract

Inputs

Name	Type	Required	Description
--suffix	str	Yes	Path to the data file (suffix array expected at <path>.table.bin)
--query	str	No	Query string (mutually exclusive with --query_file; one is required)
--query_file	str	No	Path to a file containing the query (mutually exclusive with --query)
--tokenize	flag	No	Tokenize the query before searching (for tokenized datasets)
--tokenizer	str	No	Tokenizer choice: "gpt2" (default), "t5", or "mytik"
--print_location	flag	No	Also print the byte location of the first match
--load_disk	flag	No	Use memory-mapped I/O instead of loading suffix array into RAM

Outputs

Name	Type	Description
stdout	Text	Count of occurrences of the query in the dataset; optionally byte location

Usage Examples

Count Occurrences of a String

# Count how many times "the quick brown fox" appears in the dataset
python3 scripts/count_occurrences.py \
    --suffix /data/sa/wiki40b.test \
    --query "the quick brown fox"

# Output: 42

Count with Location

# Count and print the byte location of the first occurrence
python3 scripts/count_occurrences.py \
    --suffix /data/sa/wiki40b.test \
    --query "example sentence" \
    --print_location

# Output: 7
# Output: 123456 (byte position)

Tokenized Query

# Search for a tokenized sequence in a tokenized dataset
python3 scripts/count_occurrences.py \
    --suffix /data/sa/wiki40b.test \
    --query "the quick brown fox" \
    --tokenize \
    --tokenizer gpt2

Query from File

# Search for the contents of a file
python3 scripts/count_occurrences.py \
    --suffix /data/sa/wiki40b.test \
    --query_file /tmp/my_query.txt

Low Memory Mode

# Use disk-based suffix array access for large datasets
python3 scripts/count_occurrences.py \
    --suffix /data/sa/c4.train \
    --query "specific passage" \
    --load_disk

Related Pages

Implements Principle

Principle:Google_research_Deduplicate_text_datasets_Substring_Occurrence_Querying

Requires Environment

Environment:Google_research_Deduplicate_text_datasets_Rust_Cargo_Build_Environment

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment