Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Environment:Google research Deduplicate text datasets Python TFDS Environment

From Leeroopedia
Knowledge Sources
Domains Infrastructure, NLP, Text_Deduplication
Last Updated 2026-02-14 21:00 GMT

Overview

Python 3 environment with TensorFlow 2.9, TensorFlow Datasets 4.9.3, and HuggingFace Transformers for TFDS-based dataset loading and deduplication workflows.

Description

This environment provides the Python runtime needed for TFDS dataset serialization (`load_dataset.py`), TFDS deduplication finalization (`finish_dedup_wiki40b.py`), and suffix array orchestration (`make_suffix_array.py`). It includes pinned TensorFlow dependencies from `requirements-tf.txt` and the `transformers` library for GPT-2/T5 tokenization. The `numpy` and `scipy` packages are also required for numeric operations.

Usage

Use this environment for any workflow that involves TensorFlow Datasets (TFDS). This includes loading Wiki40B or other TFDS datasets to flat binary format, applying deduplication results back to TFDS datasets, and any tokenization using GPT-2 or T5 tokenizers. This is the mandatory prerequisite for running the `Load_Dataset_TFDS`, `Finish_Dedup_Wiki40b`, and `Make_Suffix_Array` implementations.

System Requirements

Category Requirement Notes
OS Linux (Ubuntu recommended) macOS may work but is not tested
Hardware CPU (no GPU required) Dataset loading and serialization are CPU-bound
RAM Proportional to dataset size Wiki40B test set requires minimal RAM; full training sets require more
Disk 2x dataset size minimum Binary serialization output + suffix array table files

Dependencies

System Packages

  • Python 3 runtime

Python Packages (requirements-tf.txt)

  • `tensorflow` == 2.9.0
  • `tensorflow-datasets` == 4.9.3
  • `tensorflow-estimator` == 2.9.0
  • `tensorflow-io-gcs-filesystem` == 0.34.0
  • `tensorflow-metadata` == 1.14.0

Additional Python Packages

  • `numpy`
  • `scipy`
  • `sentencepiece`
  • `transformers` (for GPT2Tokenizer, T5Tokenizer)

Credentials

No credentials are required for local TFDS dataset loading. If loading from GCS:

  • `GOOGLE_APPLICATION_CREDENTIALS`: Path to GCP service account JSON (for accessing remote TFDS data directories).

Quick Install

# Install Python dependencies
pip3 install numpy scipy sentencepiece
pip3 install -r requirements-tf.txt
pip3 install transformers

Code Evidence

TensorFlow and TFDS imports from `scripts/load_dataset.py:14-15`:

import tensorflow_datasets as tfds
import tensorflow as tf

Transformers tokenizer usage from `scripts/load_dataset.py:19`:

from transformers import GPT2Tokenizer, T5Tokenizer

Pinned dependency versions from `requirements-tf.txt:1-5`:

tensorflow==2.9.0
tensorflow-datasets==4.9.3
tensorflow-estimator==2.9.0
tensorflow-io-gcs-filesystem==0.34.0
tensorflow-metadata==1.14.0

TFDS batch loading with batch_size=2**16 from `scripts/load_dataset.py:49`:

ds = tfds.load(dataset_name, split=split, shuffle_files=False, batch_size=2**16,
               data_dir=data_dir)

Common Errors

Error Message Cause Solution
`ModuleNotFoundError: No module named 'tensorflow'` TensorFlow not installed `pip3 install -r requirements-tf.txt`
`ModuleNotFoundError: No module named 'tensorflow_datasets'` TFDS not installed `pip3 install tensorflow-datasets==4.9.3`
`ModuleNotFoundError: No module named 'transformers'` Transformers not installed `pip3 install transformers`
`DatasetNotFoundError` TFDS dataset not downloaded Set `--data_dir` to a directory where TFDS can download the dataset

Compatibility Notes

  • TensorFlow version: Pinned to 2.9.0. Later versions may work but are untested.
  • Multiprocessing: Uses `mp.get_context("fork").Pool` which requires a fork-capable OS (Linux/macOS). Windows is not supported.
  • Tokenization: Optional `--tokenize` flag requires `transformers` library. Supported tokenizers: `gpt2`, `t5`.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment