Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Environment:Huggingface Datasets Python PyArrow Core

From Leeroopedia
Knowledge Sources
Domains Infrastructure, Data_Processing
Last Updated 2026-02-14 19:00 GMT

Overview

This is the core runtime environment for the HuggingFace Datasets library, defining the minimum Python version, PyArrow backend, and all required system-level dependencies needed to load, process, and cache datasets.

Description

The HuggingFace Datasets library requires Python >= 3.10.0 and uses Apache Arrow (via PyArrow >= 21.0.0) as its columnar in-memory backend for high-performance data serialization and processing. The core environment encompasses all mandatory dependencies declared in setup.py under REQUIRED_PKGS, along with environment variables defined in src/datasets/config.py that control caching behavior, offline mode, in-memory limits, progress bar display, and ML framework selection.

Key architectural decisions reflected in this environment include:

  • PyArrow >= 21.0.0 is required to support use_content_defined_chunking in ParquetWriter, which enables deterministic shard boundaries during dataset preparation.
  • dill >= 0.3.0, < 0.4.1 is temporarily pinned because dill does not yet have official support for deterministic serialization (see dill#19).
  • multiprocess < 0.70.19 is pinned to align with the dill version constraint, as multiprocess bundles dill internally.
  • fsspec[http] >= 2023.1.0 is the minimum version that supports protocol=kwargs in fsspec's open, get_fs_token_paths, and related methods (see fsspec#1143).

Usage

Reference this environment page whenever you need to understand or reproduce the baseline runtime for any HuggingFace Datasets operation. All implementation pages in this wiki depend on this core environment. If you are working with optional extras (audio, vision, TensorFlow, PyTorch, JAX), consult the relevant optional environment pages in addition to this one.

System Requirements

Component Requirement Notes
Python >= 3.10.0 Supports 3.10, 3.11, 3.12, 3.13, 3.14 per setup.py classifiers
Operating System Linux, macOS, Windows Operating System :: OS Independent classifier in setup.py
Disk Space Varies Datasets are cached to ~/.cache/huggingface/datasets by default; large datasets may require significant disk space
Memory Varies In-memory mode is disabled by default (HF_DATASETS_IN_MEMORY_MAX_SIZE=0); Arrow memory-mapped files are used instead

Dependencies

System Packages

No system-level packages beyond Python itself are required for the core environment. All dependencies are pure Python or ship pre-built wheels:

  • Python >= 3.10.0 (CPython recommended)
  • A working C compiler is only needed if building PyArrow from source (rare; wheels are available for all supported platforms)

Python Packages

The following are the core dependencies declared in REQUIRED_PKGS in setup.py:

Package Version Constraint Purpose
pyarrow >= 21.0.0 Backend columnar storage and serialization; minimum for use_content_defined_chunking in ParquetWriter
dill >= 0.3.0, < 0.4.1 Smart caching of dataset processing functions; pinned pending determinism support
multiprocess < 0.70.19 Better multiprocessing; version aligned with dill pin
fsspec[http] >= 2023.1.0, <= 2025.10.0 Filesystem abstraction for local, remote, and cloud storage; minimum for protocol=kwargs support
huggingface-hub >= 0.25.0, < 2.0 Interaction with the HuggingFace Hub (downloading, uploading, authentication)
pandas (no version constraint) Performance gains with Apache Arrow interchange
numpy >= 1.17 Required for np.random.Generator used in Dataset shuffling
requests >= 2.32.2 HTTP downloads of datasets
httpx < 1.0.0 Alternative HTTP client
tqdm >= 4.66.3 Progress bars for downloads and data operations
xxhash (no version constraint) Fast hashing for caching
filelock (no version constraint) File locking for safe concurrent access
packaging (no version constraint) Version parsing and comparison utilities (PyPA)
pyyaml >= 5.1 Parsing YAML metadata from dataset cards

Credentials

No API keys or tokens are strictly required for the core environment. However, the following environment variables (defined in src/datasets/config.py) control runtime behavior:

Variable Default Description
HF_ENDPOINT https://huggingface.co Base URL for the HuggingFace Hub API
HF_DATASETS_CACHE ~/.cache/huggingface/datasets Root directory for cached datasets
HF_HOME ~/.cache/huggingface Root directory for all HuggingFace caches (hub, datasets, modules)
XDG_CACHE_HOME ~/.cache XDG base directory for caches; HF_HOME defaults to $XDG_CACHE_HOME/huggingface
HF_DATASETS_OFFLINE (unset) Set to 1 to enable offline mode; falls back to HF_HUB_OFFLINE if unset
HF_DATASETS_IN_MEMORY_MAX_SIZE 0 (disabled) Maximum dataset size in bytes to keep entirely in memory; 0 means always use memory-mapped Arrow files
HF_DATASETS_DISABLE_PROGRESS_BARS (unset) Set to 1 to globally disable progress bars; set to 0 to force enable; unset allows programmatic control
HF_UPDATE_DOWNLOAD_COUNTS AUTO Whether to update download counts on the Hub when fetching a dataset
USE_TF AUTO Framework selection: set to 1 to force TensorFlow, 0 to disable
USE_TORCH AUTO Framework selection: set to 1 to force PyTorch, 0 to disable
USE_JAX AUTO Framework selection: set to 1 to force JAX, 0 to disable

Quick Install

# Install the core datasets library with all required dependencies
pip install datasets

# Or install a specific version
pip install datasets==4.5.0

# Verify the installation
python -c "import datasets; print(datasets.__version__)"

To install from source (main branch):

pip install "datasets @ git+https://github.com/huggingface/datasets@main#egg=datasets"

Code Evidence

From setup.py

python_requires=">=3.10.0"

REQUIRED_PKGS = [
    # For file locking
    "filelock",
    # We use numpy>=1.17 to have np.random.Generator (Dataset shuffling)
    "numpy>=1.17",
    # Backend and serialization.
    # Minimum 21.0.0 to support `use_content_defined_chunking` in ParquetWriter
    "pyarrow>=21.0.0",
    # For smart caching dataset processing
    "dill>=0.3.0,<0.4.1",  # tmp pin until dill has official support for determinism
    # For performance gains with apache arrow
    "pandas",
    # for downloading datasets over HTTPS
    "requests>=2.32.2",
    "httpx<1.0.0",
    # progress bars in downloads and data operations
    "tqdm>=4.66.3",
    # for fast hashing
    "xxhash",
    # for better multiprocessing
    "multiprocess<0.70.19",  # to align with dill<0.3.9 (see above)
    # to save datasets locally or on any filesystem
    "fsspec[http]>=2023.1.0,<=2025.10.0",
    # To get datasets from the Datasets Hub on huggingface.co
    "huggingface-hub>=0.25.0,<2.0",
    # Utilities from PyPA to e.g., compare versions
    "packaging",
    # To parse YAML metadata from dataset cards
    "pyyaml>=5.1",
]

From src/datasets/config.py

# Hub
HF_ENDPOINT = os.environ.get("HF_ENDPOINT", "https://huggingface.co")

# Cache location
DEFAULT_XDG_CACHE_HOME = "~/.cache"
XDG_CACHE_HOME = os.getenv("XDG_CACHE_HOME", DEFAULT_XDG_CACHE_HOME)
DEFAULT_HF_CACHE_HOME = os.path.join(XDG_CACHE_HOME, "huggingface")
HF_CACHE_HOME = os.path.expanduser(os.getenv("HF_HOME", DEFAULT_HF_CACHE_HOME))

DEFAULT_HF_DATASETS_CACHE = os.path.join(HF_CACHE_HOME, "datasets")
HF_DATASETS_CACHE = Path(os.getenv("HF_DATASETS_CACHE", DEFAULT_HF_DATASETS_CACHE))

# Offline mode
_offline = os.environ.get("HF_DATASETS_OFFLINE")
HF_HUB_OFFLINE = constants.HF_HUB_OFFLINE if _offline is None else _offline.upper() in ENV_VARS_TRUE_VALUES

# In-memory
DEFAULT_IN_MEMORY_MAX_SIZE = 0  # Disabled
IN_MEMORY_MAX_SIZE = float(os.environ.get("HF_DATASETS_IN_MEMORY_MAX_SIZE", DEFAULT_IN_MEMORY_MAX_SIZE))

# Framework selection
USE_TF = os.environ.get("USE_TF", "AUTO").upper()
USE_TORCH = os.environ.get("USE_TORCH", "AUTO").upper()
USE_JAX = os.environ.get("USE_JAX", "AUTO").upper()

Common Errors

Error Cause Resolution
ImportError: PyArrow >= 21.0.0 must be installed PyArrow version too old or not installed Run pip install "pyarrow>=21.0.0"
ConnectionError when loading a dataset No internet access and offline mode not configured Set HF_DATASETS_OFFLINE=1 and ensure the dataset is already cached locally
MemoryError during .map() Dataset too large for available RAM with in-memory mode enabled Set HF_DATASETS_IN_MEMORY_MAX_SIZE=0 to use memory-mapped Arrow files instead
dill serialization errors after upgrading dill version outside the pinned range (>= 0.4.1) Pin dill to >=0.3.0,<0.4.1 as required by the library
fsspec TypeError on protocol kwarg fsspec version below 2023.1.0 lacks protocol=kwargs support Upgrade with pip install "fsspec[http]>=2023.1.0"
FileNotFoundError for cache directory Custom HF_DATASETS_CACHE path does not exist Create the directory or correct the environment variable path
multiprocess version conflict multiprocess and dill versions misaligned Ensure multiprocess<0.70.19 and dill>=0.3.0,<0.4.1 are both satisfied

Compatibility Notes

  • Python 3.14 support is declared in setup.py classifiers but some optional test dependencies (numba, joblibspark, lz4, torchcodec) do not yet have Python 3.14 wheels. The core environment is fully compatible with 3.14.
  • numpy 2.x compatibility: The core environment specifies numpy>=1.17 with no upper bound, so numpy 2.x is supported. However, some optional test dependencies (faiss-cpu, tensorflow) are incompatible with numpy 2.x and are excluded from the tests_numpy2 extras group.
  • PyArrow 21.0.0 introduced content-defined chunking for Parquet, which is used by Datasets for deterministic shard boundaries. Earlier PyArrow versions will not work.
  • fsspec upper bound (<=2025.10.0) exists to prevent breakage from future fsspec API changes; this bound is periodically raised as new fsspec versions are validated.
  • huggingface-hub < 2.0 upper bound guards against potential breaking API changes in a future major release.
  • Framework auto-detection: By default, USE_TF, USE_TORCH, and USE_JAX are all set to "AUTO". The library will detect installed frameworks at import time. Setting one to "1" or "TRUE" disables auto-detection of the others (e.g., setting USE_TF=1 disables PyTorch detection).
  • License: The library is released under the Apache 2.0 license.

Related Pages

The following implementation pages depend on this core environment:

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment