Environment:ChenghaoMou Text dedup Python 3 12 Environment
| Knowledge Sources | |
|---|---|
| Domains | Infrastructure, NLP, Text_Deduplication |
| Last Updated | 2026-02-14 21:00 GMT |
Overview
Python 3.12+ environment with CPU-based text deduplication libraries including Polars, HuggingFace Datasets, NumPy, SciPy, and specialized hashing/fingerprinting packages.
Description
This environment provides the standard runtime for the text-dedup library. It is a CPU-based Python environment (no GPU required) built around data processing and hashing libraries. The stack uses Polars for high-performance DataFrame operations, HuggingFace Datasets for dataset I/O, and specialized packages for Bloom filtering (rbloom), bit manipulation (bitarray), and fast hashing (xxhash). Pydantic is used for configuration management with CLI support via pydantic-settings.
Usage
Use this environment for all text deduplication workflows: MinHash LSH, SimHash, Bloom Filter, and Suffix Array deduplication. It is also required for running benchmarks and the Gradio report application. This is the base prerequisite for every Implementation in the repository.
System Requirements
| Category | Requirement | Notes |
|---|---|---|
| OS | Linux, macOS, or Windows | Cross-platform; CI runs on Ubuntu |
| Hardware | CPU with multiple cores | `num_proc` defaults to `os.cpu_count()`; no GPU required |
| Memory | Depends on dataset size | `check_env()` logs available memory at startup |
| Disk | Sufficient for dataset caching | HuggingFace Datasets uses Arrow cache files |
Dependencies
System Packages
- Python >= 3.12, < 4.0
- `uv` >= 0.6.14 (recommended package manager, used in CI)
Python Packages
- `bitarray` >= 3.7.1
- `datasets` >= 4.0.0
- `ftfy` >= 6.3.1
- `numpy` >= 1.26.0
- `polars` >= 1.32.3
- `polars-grouper` >= 0.3.0
- `psutil` >= 5.9.0
- `pydantic` >= 2.0.0
- `pydantic-settings` >= 2.9.1
- `rbloom` >= 1.5.2
- `regex` >= 2024.11.6
- `rich` >= 13.0.0
- `scipy` >= 1.13.1
- `tqdm` >= 4.66.0
- `xxhash` >= 3.5.0
Optional Packages
- `scalene` >= 1.5.51 (profiling, conditionally imported when `enable_profiling=True`)
- `gradio` >= 4.44.1 (report UI)
- `gradio-rangeslider` >= 0.0.8 (report UI)
- `plotly` >= 6.1.1 (report UI)
Credentials
No credentials or environment variables are required. The library operates on local datasets and does not call external APIs.
Quick Install
# Install with pip
pip install text-dedup
# Or install from source with uv (recommended)
git clone https://github.com/ChenghaoMou/text-dedup.git
cd text-dedup
uv sync --frozen
Code Evidence
Python version constraint from `pyproject.toml:6`:
requires-python = ">=3.12,<4.0"
Environment check function from `src/text_dedup/utils/env.py:4-10`:
def check_env() -> None: # pragma: no cover
import polars as pl
import psutil
log.info(f"Polars thread pool size: {pl.thread_pool_size()}")
log.info(f"CPU count: {psutil.cpu_count()}")
log.info(f"Available memory: {psutil.virtual_memory().available / 1024 / 1024 / 1024:.2f} GB")
CPU count auto-detection from `src/text_dedup/config/algorithms/base.py:19`:
num_proc: int = max(1, os.cpu_count() or 1)
All main entry points call `check_env()` at startup, e.g., `src/text_dedup/minhash.py:224-230`:
if __name__ == "__main__":
from pydantic_settings import CliApp
from text_dedup.utils.env import check_env
config = CliApp.run(Config)
check_env()
Common Errors
| Error Message | Cause | Solution |
|---|---|---|
| `ModuleNotFoundError: No module named 'polars_grouper'` | Missing dependency | `pip install polars-grouper>=0.3.0` |
| `ModuleNotFoundError: No module named 'scalene'` | Profiling enabled but scalene not installed | `pip install scalene` or set `enable_profiling=False` |
| `Python version <3.12 not supported` | Using older Python | Upgrade to Python 3.12 or 3.13 |
Compatibility Notes
- Python 3.12+: This library requires Python 3.12 or newer. It targets `py312` in Ruff configuration and uses modern Python features like `match/case` statements and `type` unions.
- No GPU required: All algorithms are CPU-based. The library uses NumPy, Polars, and pure Python for computation.
- CI Environment: GitHub Actions CI uses `uv sync --frozen` for dependency resolution with Python 3.12 default.
Related Pages
- Implementation:ChenghaoMou_Text_dedup_Config_Loading
- Implementation:ChenghaoMou_Text_dedup_Load_Dataset
- Implementation:ChenghaoMou_Text_dedup_Save_Dataset
- Implementation:ChenghaoMou_Text_dedup_MinHash_Get_Embed_Func
- Implementation:ChenghaoMou_Text_dedup_MinHash_LSH_Cluster
- Implementation:ChenghaoMou_Text_dedup_MinHash_Check_False_Positives
- Implementation:ChenghaoMou_Text_dedup_SimHash_Get_Embed_Func
- Implementation:ChenghaoMou_Text_dedup_SimHash_Union_Find_Cluster
- Implementation:ChenghaoMou_Text_dedup_SimHash_Check_False_Positives
- Implementation:ChenghaoMou_Text_dedup_Bloom_Filter_Func
- Implementation:ChenghaoMou_Text_dedup_SA_Run_Command
- Implementation:ChenghaoMou_Text_dedup_SA_Restore_And_Merge
- Implementation:ChenghaoMou_Text_dedup_Jaccard_Similarity_Func
- Implementation:ChenghaoMou_Text_dedup_UnionFind_Class
- Implementation:ChenghaoMou_Text_dedup_Evaluate_Predictions