Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Environment:ChenghaoMou Text dedup Python 3 12 Environment

From Leeroopedia
Knowledge Sources
Domains Infrastructure, NLP, Text_Deduplication
Last Updated 2026-02-14 21:00 GMT

Overview

Python 3.12+ environment with CPU-based text deduplication libraries including Polars, HuggingFace Datasets, NumPy, SciPy, and specialized hashing/fingerprinting packages.

Description

This environment provides the standard runtime for the text-dedup library. It is a CPU-based Python environment (no GPU required) built around data processing and hashing libraries. The stack uses Polars for high-performance DataFrame operations, HuggingFace Datasets for dataset I/O, and specialized packages for Bloom filtering (rbloom), bit manipulation (bitarray), and fast hashing (xxhash). Pydantic is used for configuration management with CLI support via pydantic-settings.

Usage

Use this environment for all text deduplication workflows: MinHash LSH, SimHash, Bloom Filter, and Suffix Array deduplication. It is also required for running benchmarks and the Gradio report application. This is the base prerequisite for every Implementation in the repository.

System Requirements

Category Requirement Notes
OS Linux, macOS, or Windows Cross-platform; CI runs on Ubuntu
Hardware CPU with multiple cores `num_proc` defaults to `os.cpu_count()`; no GPU required
Memory Depends on dataset size `check_env()` logs available memory at startup
Disk Sufficient for dataset caching HuggingFace Datasets uses Arrow cache files

Dependencies

System Packages

  • Python >= 3.12, < 4.0
  • `uv` >= 0.6.14 (recommended package manager, used in CI)

Python Packages

  • `bitarray` >= 3.7.1
  • `datasets` >= 4.0.0
  • `ftfy` >= 6.3.1
  • `numpy` >= 1.26.0
  • `polars` >= 1.32.3
  • `polars-grouper` >= 0.3.0
  • `psutil` >= 5.9.0
  • `pydantic` >= 2.0.0
  • `pydantic-settings` >= 2.9.1
  • `rbloom` >= 1.5.2
  • `regex` >= 2024.11.6
  • `rich` >= 13.0.0
  • `scipy` >= 1.13.1
  • `tqdm` >= 4.66.0
  • `xxhash` >= 3.5.0

Optional Packages

  • `scalene` >= 1.5.51 (profiling, conditionally imported when `enable_profiling=True`)
  • `gradio` >= 4.44.1 (report UI)
  • `gradio-rangeslider` >= 0.0.8 (report UI)
  • `plotly` >= 6.1.1 (report UI)

Credentials

No credentials or environment variables are required. The library operates on local datasets and does not call external APIs.

Quick Install

# Install with pip
pip install text-dedup

# Or install from source with uv (recommended)
git clone https://github.com/ChenghaoMou/text-dedup.git
cd text-dedup
uv sync --frozen

Code Evidence

Python version constraint from `pyproject.toml:6`:

requires-python = ">=3.12,<4.0"

Environment check function from `src/text_dedup/utils/env.py:4-10`:

def check_env() -> None:  # pragma: no cover
    import polars as pl
    import psutil

    log.info(f"Polars thread pool size: {pl.thread_pool_size()}")
    log.info(f"CPU count: {psutil.cpu_count()}")
    log.info(f"Available memory: {psutil.virtual_memory().available / 1024 / 1024 / 1024:.2f} GB")

CPU count auto-detection from `src/text_dedup/config/algorithms/base.py:19`:

num_proc: int = max(1, os.cpu_count() or 1)

All main entry points call `check_env()` at startup, e.g., `src/text_dedup/minhash.py:224-230`:

if __name__ == "__main__":
    from pydantic_settings import CliApp
    from text_dedup.utils.env import check_env

    config = CliApp.run(Config)
    check_env()

Common Errors

Error Message Cause Solution
`ModuleNotFoundError: No module named 'polars_grouper'` Missing dependency `pip install polars-grouper>=0.3.0`
`ModuleNotFoundError: No module named 'scalene'` Profiling enabled but scalene not installed `pip install scalene` or set `enable_profiling=False`
`Python version <3.12 not supported` Using older Python Upgrade to Python 3.12 or 3.13

Compatibility Notes

  • Python 3.12+: This library requires Python 3.12 or newer. It targets `py312` in Ruff configuration and uses modern Python features like `match/case` statements and `type` unions.
  • No GPU required: All algorithms are CPU-based. The library uses NumPy, Polars, and pure Python for computation.
  • CI Environment: GitHub Actions CI uses `uv sync --frozen` for dependency resolution with Python 3.12 default.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment