Principle:Google research Deduplicate text datasets Byte Range Removal
| Knowledge Sources | |
|---|---|
| Domains | Text_Deduplication, Data_Processing |
| Last Updated | 2026-02-14 21:00 GMT |
Overview
A data transformation technique that excises specified byte ranges from a flat binary file to produce a deduplicated output, preserving all non-duplicate content in its original order.
Description
Byte range removal is the final step in the flat-file deduplication pipeline. Given an original binary file and a list of byte ranges to remove (produced by the collect step), it outputs a new file containing only the non-duplicate portions of the original. The algorithm reads the removal file in reverse order (highest offsets first), then processes the original file sequentially, copying non-removed regions and skipping removed ones.
This approach operates at the raw byte level, making it format-agnostic. It works on any flat binary file regardless of the text encoding, tokenization, or separator scheme used during serialization. The only constraint is that the removal file must contain valid, non-overlapping byte ranges in ascending order (as produced by the collect step).
Usage
Use this technique as the final step in single-file deduplication or cross-dataset deduplication when working with flat binary files. For TFDS-structured datasets that need to maintain their example-level schema, use the TFDS deduplication application technique instead.
Theoretical Basis
The algorithm is a simple linear scan with skip regions:
# Abstract algorithm (NOT real implementation)
def remove_byte_ranges(original_file, ranges, output_file):
# ranges = [(start1, end1), (start2, end2), ...] in ascending order
# Process from end to start (stack order) for efficient popping
ranges = reversed(ranges)
current_pos = 0
for start, end in ranges:
# Copy everything from current position up to the start of this range
output_file.write(original_file.read(start - current_pos))
# Skip the duplicate range
original_file.seek(end)
current_pos = end
# Copy remaining content after the last range
output_file.write(original_file.read())
Key properties:
- O(N) time complexity: The file is read exactly once in a single pass.
- O(1) memory: Only the removal list and current read position need to be in memory; data is streamed.
- Order preserving: Non-duplicate content appears in the output in the same order as the original.
- Format agnostic: Works on any binary content without needing to understand the data format.