Implementation:Google research Deduplicate text datasets Count Occurrences
| Knowledge Sources | |
|---|---|
| Domains | String_Algorithms, Data_Structures, NLP |
| Last Updated | 2026-02-14 21:00 GMT |
Overview
Concrete tool for counting substring occurrences in a dataset using suffix array binary search provided by the deduplicate-text-datasets repository.
Description
The substring query system has two layers: a Python wrapper (count_occurrences.py) that handles tokenization, query encoding, and temporary file management, and a Rust CLI subcommand (count-occurrences) that performs the actual binary search on the suffix array.
The Python wrapper supports three tokenizers (GPT-2 via transformers, T5 via transformers, and a custom tiktoken-based tokenizer). It encodes the query string (or file) into the appropriate byte representation, writes it to a temporary file (/tmp/fin), then invokes the Rust CLI.
The Rust subcommand loads the data file and suffix array, reads the query from the query file, and performs two binary searches to find the range of matching suffixes. It reports the count to stdout and optionally the byte location of the first match.
Usage
Use this tool when you need to check how many times a specific string or token sequence appears in a dataset. The dataset must already be serialized and have a suffix array built. Useful for verifying deduplication, checking data contamination, or analyzing corpus composition.
Code Reference
Source Location
- Repository: deduplicate-text-datasets
- File (Python wrapper): scripts/count_occurrences.py (L1-99)
- File (Rust CLI args): src/main.rs (L95-104)
- File (Rust implementation): src/main.rs (L634-672)
Signature
# Python wrapper (preferred entry point)
python3 scripts/count_occurrences.py \
--suffix <data_path> \
--query <query_string> \
[--query_file <query_file_path>] \
[--tokenize] \
[--tokenizer gpt2|t5|mytik] \
[--print_location] \
[--load_disk]
# Underlying Rust subcommand (called by wrapper)
dedup_dataset count-occurrences \
--data-file <path> \
--query-file <query_file> \
[--print-location] \
[--load-disk]
Import
# CLI tool, not importable. Requires prior build:
cargo build
# Then invoke via the Python wrapper:
python3 scripts/count_occurrences.py --suffix <data_path> --query "search string"
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| --suffix | str | Yes | Path to the data file (suffix array expected at <path>.table.bin) |
| --query | str | No | Query string (mutually exclusive with --query_file; one is required) |
| --query_file | str | No | Path to a file containing the query (mutually exclusive with --query) |
| --tokenize | flag | No | Tokenize the query before searching (for tokenized datasets) |
| --tokenizer | str | No | Tokenizer choice: "gpt2" (default), "t5", or "mytik" |
| --print_location | flag | No | Also print the byte location of the first match |
| --load_disk | flag | No | Use memory-mapped I/O instead of loading suffix array into RAM |
Outputs
| Name | Type | Description |
|---|---|---|
| stdout | Text | Count of occurrences of the query in the dataset; optionally byte location |
Usage Examples
Count Occurrences of a String
# Count how many times "the quick brown fox" appears in the dataset
python3 scripts/count_occurrences.py \
--suffix /data/sa/wiki40b.test \
--query "the quick brown fox"
# Output: 42
Count with Location
# Count and print the byte location of the first occurrence
python3 scripts/count_occurrences.py \
--suffix /data/sa/wiki40b.test \
--query "example sentence" \
--print_location
# Output: 7
# Output: 123456 (byte position)
Tokenized Query
# Search for a tokenized sequence in a tokenized dataset
python3 scripts/count_occurrences.py \
--suffix /data/sa/wiki40b.test \
--query "the quick brown fox" \
--tokenize \
--tokenizer gpt2
Query from File
# Search for the contents of a file
python3 scripts/count_occurrences.py \
--suffix /data/sa/wiki40b.test \
--query_file /tmp/my_query.txt
Low Memory Mode
# Use disk-based suffix array access for large datasets
python3 scripts/count_occurrences.py \
--suffix /data/sa/c4.train \
--query "specific passage" \
--load_disk