Principle:Google research Deduplicate text datasets Substring Occurrence Querying
| Knowledge Sources | |
|---|---|
| Domains | String_Algorithms, Data_Structures, NLP |
| Last Updated | 2026-02-14 21:00 GMT |
Overview
A query technique that uses a precomputed suffix array to count (and optionally locate) all occurrences of a given substring in a large text dataset in O(M log N) time.
Description
Substring occurrence querying leverages the sorted property of suffix arrays to perform fast substring search via binary search. Given a query string of length M and a dataset of size N with a precomputed suffix array, the algorithm finds all positions where the query appears by performing two binary searches: one to find the leftmost matching suffix and one to find the rightmost. The count of occurrences is simply the difference between these two positions.
This technique enables ad-hoc analysis of large corpora without re-scanning the entire dataset. For example, researchers can quickly check how many times a specific sentence, phrase, or token sequence appears in a training set. With the optional --print-location flag, the exact byte position of each occurrence can also be reported.
The query can operate on raw text (UTF-8 encoded) or tokenized sequences (uint16 token IDs), enabling both character-level and token-level search.
Usage
Use this technique when you need to count or locate occurrences of a specific substring in a large dataset that already has a suffix array built. It is useful for data analysis, memorization detection, and verifying deduplication results. Do not use for bulk deduplication (use self-similar or across-similar instead).
Theoretical Basis
Binary search on a suffix array:
# Abstract algorithm (NOT real implementation)
def count_occurrences(data, suffix_array, query):
"""Count occurrences of query in data using binary search on suffix array."""
n = len(suffix_array)
# Find leftmost position where query could appear
lo, hi = 0, n
while lo < hi:
mid = (lo + hi) // 2
suffix = data[suffix_array[mid]:]
if suffix[:len(query)] < query:
lo = mid + 1
else:
hi = mid
left = lo
# Find rightmost position where query could appear
lo, hi = left, n
while lo < hi:
mid = (lo + hi) // 2
suffix = data[suffix_array[mid]:]
if suffix[:len(query)] <= query:
lo = mid + 1
else:
hi = mid
right = lo
count = right - left
positions = [suffix_array[i] for i in range(left, right)]
return count, positions
Time complexity: O(M log N) where M is the query length and N is the dataset size. Each binary search step requires an O(M) string comparison, and there are O(log N) steps.
Memory modes:
- RAM mode (default): Load the entire suffix array into memory for fastest queries.
- Disk mode (
--load-disk): Stream the suffix array from disk using memory-mapped I/O, trading speed for lower memory usage.