Principle:Google research Deduplicate text datasets TFDS Deduplication Application
| Knowledge Sources | |
|---|---|
| Domains | Text_Deduplication, Data_Processing, NLP |
| Last Updated | 2026-02-14 21:00 GMT |
Overview
A data transformation technique that maps byte-level duplicate ranges back to structured dataset examples, applies substring removal within each example, and rebuilds a clean TensorFlow Dataset.
Description
TFDS deduplication application solves the problem of applying byte-level deduplication results back to a structured dataset that has an example-level schema. While the suffix array operates on a flat binary representation, the original dataset consists of discrete examples with metadata fields (text, URL, timestamp, etc.). This technique bridges the gap by:
- Using the
.sizeoffset file (from the serialization step) to map each byte range to the example it belongs to. - Computing per-example sub-ranges by adjusting byte offsets relative to each example's start position.
- Applying substring removal within each example's text field while preserving all other metadata.
- Rebuilding a valid TFDS dataset using a custom
GeneratorBasedBuilderthat produces cleaned examples. - Restructuring the output directory to match the expected TFDS layout (e.g.,
wiki40b/en/1.3.0/).
This is more complex than flat-file byte removal because the output must be a valid TFDS dataset that can be loaded with tfds.load().
Usage
Use this technique as the final step when deduplicating a TFDS dataset (specifically Wiki40B as demonstrated in the paper). For flat binary files that do not need to maintain a TFDS schema, use byte range removal instead.
Theoretical Basis
The mapping from byte ranges to per-example ranges uses the cumulative size array:
# Abstract algorithm (NOT real implementation)
def map_byte_ranges_to_examples(byte_ranges, sizes):
"""Map global byte ranges to per-example (start, end) pairs."""
example_ranges = defaultdict(list)
ptr = 0 # pointer into byte_ranges
for i, byte_start in enumerate(sizes[:-1]):
byte_end = sizes[i + 1]
while ptr < len(byte_ranges) and byte_start <= byte_ranges[ptr][0] < byte_end:
# Adjust to example-relative offsets, accounting for 6-byte separator
local_start = max(byte_ranges[ptr][0] - byte_start - 6, 0)
local_end = min(byte_ranges[ptr][1] - byte_start, byte_end - byte_start)
example_ranges[i].append((local_start, local_end))
ptr += 1
return example_ranges
def apply_dedup_to_example(text, ranges):
"""Remove duplicate substrings from a single example."""
for start, end in reversed(ranges):
text = text[:start] + text[end:]
return text
Key details:
- The 6-byte separator offset (
\xff\xff+ 4-byte UID) must be subtracted when mapping global byte positions to example-local positions. - Ranges are applied in reverse order within each example to avoid invalidating later offsets.
- The algorithm uses
multiprocessingfor parallel example processing. - The output is restructured into the standard TFDS directory layout with merged
dataset_info.json.