Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Google research Deduplicate text datasets Count Occurrences

From Leeroopedia
Knowledge Sources
Domains String_Algorithms, Data_Structures, NLP
Last Updated 2026-02-14 21:00 GMT

Overview

Concrete tool for counting substring occurrences in a dataset using suffix array binary search provided by the deduplicate-text-datasets repository.

Description

The substring query system has two layers: a Python wrapper (count_occurrences.py) that handles tokenization, query encoding, and temporary file management, and a Rust CLI subcommand (count-occurrences) that performs the actual binary search on the suffix array.

The Python wrapper supports three tokenizers (GPT-2 via transformers, T5 via transformers, and a custom tiktoken-based tokenizer). It encodes the query string (or file) into the appropriate byte representation, writes it to a temporary file (/tmp/fin), then invokes the Rust CLI.

The Rust subcommand loads the data file and suffix array, reads the query from the query file, and performs two binary searches to find the range of matching suffixes. It reports the count to stdout and optionally the byte location of the first match.

Usage

Use this tool when you need to check how many times a specific string or token sequence appears in a dataset. The dataset must already be serialized and have a suffix array built. Useful for verifying deduplication, checking data contamination, or analyzing corpus composition.

Code Reference

Source Location

  • Repository: deduplicate-text-datasets
  • File (Python wrapper): scripts/count_occurrences.py (L1-99)
  • File (Rust CLI args): src/main.rs (L95-104)
  • File (Rust implementation): src/main.rs (L634-672)

Signature

# Python wrapper (preferred entry point)
python3 scripts/count_occurrences.py \
    --suffix <data_path> \
    --query <query_string> \
    [--query_file <query_file_path>] \
    [--tokenize] \
    [--tokenizer gpt2|t5|mytik] \
    [--print_location] \
    [--load_disk]

# Underlying Rust subcommand (called by wrapper)
dedup_dataset count-occurrences \
    --data-file <path> \
    --query-file <query_file> \
    [--print-location] \
    [--load-disk]

Import

# CLI tool, not importable. Requires prior build:
cargo build
# Then invoke via the Python wrapper:
python3 scripts/count_occurrences.py --suffix <data_path> --query "search string"

I/O Contract

Inputs

Name Type Required Description
--suffix str Yes Path to the data file (suffix array expected at <path>.table.bin)
--query str No Query string (mutually exclusive with --query_file; one is required)
--query_file str No Path to a file containing the query (mutually exclusive with --query)
--tokenize flag No Tokenize the query before searching (for tokenized datasets)
--tokenizer str No Tokenizer choice: "gpt2" (default), "t5", or "mytik"
--print_location flag No Also print the byte location of the first match
--load_disk flag No Use memory-mapped I/O instead of loading suffix array into RAM

Outputs

Name Type Description
stdout Text Count of occurrences of the query in the dataset; optionally byte location

Usage Examples

Count Occurrences of a String

# Count how many times "the quick brown fox" appears in the dataset
python3 scripts/count_occurrences.py \
    --suffix /data/sa/wiki40b.test \
    --query "the quick brown fox"

# Output: 42

Count with Location

# Count and print the byte location of the first occurrence
python3 scripts/count_occurrences.py \
    --suffix /data/sa/wiki40b.test \
    --query "example sentence" \
    --print_location

# Output: 7
# Output: 123456 (byte position)

Tokenized Query

# Search for a tokenized sequence in a tokenized dataset
python3 scripts/count_occurrences.py \
    --suffix /data/sa/wiki40b.test \
    --query "the quick brown fox" \
    --tokenize \
    --tokenizer gpt2

Query from File

# Search for the contents of a file
python3 scripts/count_occurrences.py \
    --suffix /data/sa/wiki40b.test \
    --query_file /tmp/my_query.txt

Low Memory Mode

# Use disk-based suffix array access for large datasets
python3 scripts/count_occurrences.py \
    --suffix /data/sa/c4.train \
    --query "specific passage" \
    --load_disk

Related Pages

Implements Principle

Requires Environment

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment