Environment:Huggingface Datasets Python PyArrow Core
| Knowledge Sources | |
|---|---|
| Domains | Infrastructure, Data_Processing |
| Last Updated | 2026-02-14 19:00 GMT |
Overview
This is the core runtime environment for the HuggingFace Datasets library, defining the minimum Python version, PyArrow backend, and all required system-level dependencies needed to load, process, and cache datasets.
Description
The HuggingFace Datasets library requires Python >= 3.10.0 and uses Apache Arrow (via PyArrow >= 21.0.0) as its columnar in-memory backend for high-performance data serialization and processing. The core environment encompasses all mandatory dependencies declared in setup.py under REQUIRED_PKGS, along with environment variables defined in src/datasets/config.py that control caching behavior, offline mode, in-memory limits, progress bar display, and ML framework selection.
Key architectural decisions reflected in this environment include:
- PyArrow >= 21.0.0 is required to support
use_content_defined_chunkingin ParquetWriter, which enables deterministic shard boundaries during dataset preparation. - dill >= 0.3.0, < 0.4.1 is temporarily pinned because dill does not yet have official support for deterministic serialization (see dill#19).
- multiprocess < 0.70.19 is pinned to align with the dill version constraint, as multiprocess bundles dill internally.
- fsspec[http] >= 2023.1.0 is the minimum version that supports
protocol=kwargsin fsspec'sopen,get_fs_token_paths, and related methods (see fsspec#1143).
Usage
Reference this environment page whenever you need to understand or reproduce the baseline runtime for any HuggingFace Datasets operation. All implementation pages in this wiki depend on this core environment. If you are working with optional extras (audio, vision, TensorFlow, PyTorch, JAX), consult the relevant optional environment pages in addition to this one.
System Requirements
| Component | Requirement | Notes |
|---|---|---|
| Python | >= 3.10.0 | Supports 3.10, 3.11, 3.12, 3.13, 3.14 per setup.py classifiers |
| Operating System | Linux, macOS, Windows | Operating System :: OS Independent classifier in setup.py
|
| Disk Space | Varies | Datasets are cached to ~/.cache/huggingface/datasets by default; large datasets may require significant disk space
|
| Memory | Varies | In-memory mode is disabled by default (HF_DATASETS_IN_MEMORY_MAX_SIZE=0); Arrow memory-mapped files are used instead
|
Dependencies
System Packages
No system-level packages beyond Python itself are required for the core environment. All dependencies are pure Python or ship pre-built wheels:
- Python >= 3.10.0 (CPython recommended)
- A working C compiler is only needed if building PyArrow from source (rare; wheels are available for all supported platforms)
Python Packages
The following are the core dependencies declared in REQUIRED_PKGS in setup.py:
| Package | Version Constraint | Purpose |
|---|---|---|
| pyarrow | >= 21.0.0 | Backend columnar storage and serialization; minimum for use_content_defined_chunking in ParquetWriter
|
| dill | >= 0.3.0, < 0.4.1 | Smart caching of dataset processing functions; pinned pending determinism support |
| multiprocess | < 0.70.19 | Better multiprocessing; version aligned with dill pin |
| fsspec[http] | >= 2023.1.0, <= 2025.10.0 | Filesystem abstraction for local, remote, and cloud storage; minimum for protocol=kwargs support
|
| huggingface-hub | >= 0.25.0, < 2.0 | Interaction with the HuggingFace Hub (downloading, uploading, authentication) |
| pandas | (no version constraint) | Performance gains with Apache Arrow interchange |
| numpy | >= 1.17 | Required for np.random.Generator used in Dataset shuffling
|
| requests | >= 2.32.2 | HTTP downloads of datasets |
| httpx | < 1.0.0 | Alternative HTTP client |
| tqdm | >= 4.66.3 | Progress bars for downloads and data operations |
| xxhash | (no version constraint) | Fast hashing for caching |
| filelock | (no version constraint) | File locking for safe concurrent access |
| packaging | (no version constraint) | Version parsing and comparison utilities (PyPA) |
| pyyaml | >= 5.1 | Parsing YAML metadata from dataset cards |
Credentials
No API keys or tokens are strictly required for the core environment. However, the following environment variables (defined in src/datasets/config.py) control runtime behavior:
| Variable | Default | Description |
|---|---|---|
HF_ENDPOINT |
https://huggingface.co |
Base URL for the HuggingFace Hub API |
HF_DATASETS_CACHE |
~/.cache/huggingface/datasets |
Root directory for cached datasets |
HF_HOME |
~/.cache/huggingface |
Root directory for all HuggingFace caches (hub, datasets, modules) |
XDG_CACHE_HOME |
~/.cache |
XDG base directory for caches; HF_HOME defaults to $XDG_CACHE_HOME/huggingface
|
HF_DATASETS_OFFLINE |
(unset) | Set to 1 to enable offline mode; falls back to HF_HUB_OFFLINE if unset
|
HF_DATASETS_IN_MEMORY_MAX_SIZE |
0 (disabled) |
Maximum dataset size in bytes to keep entirely in memory; 0 means always use memory-mapped Arrow files
|
HF_DATASETS_DISABLE_PROGRESS_BARS |
(unset) | Set to 1 to globally disable progress bars; set to 0 to force enable; unset allows programmatic control
|
HF_UPDATE_DOWNLOAD_COUNTS |
AUTO |
Whether to update download counts on the Hub when fetching a dataset |
USE_TF |
AUTO |
Framework selection: set to 1 to force TensorFlow, 0 to disable
|
USE_TORCH |
AUTO |
Framework selection: set to 1 to force PyTorch, 0 to disable
|
USE_JAX |
AUTO |
Framework selection: set to 1 to force JAX, 0 to disable
|
Quick Install
# Install the core datasets library with all required dependencies
pip install datasets
# Or install a specific version
pip install datasets==4.5.0
# Verify the installation
python -c "import datasets; print(datasets.__version__)"
To install from source (main branch):
pip install "datasets @ git+https://github.com/huggingface/datasets@main#egg=datasets"
Code Evidence
From setup.py
python_requires=">=3.10.0"
REQUIRED_PKGS = [
# For file locking
"filelock",
# We use numpy>=1.17 to have np.random.Generator (Dataset shuffling)
"numpy>=1.17",
# Backend and serialization.
# Minimum 21.0.0 to support `use_content_defined_chunking` in ParquetWriter
"pyarrow>=21.0.0",
# For smart caching dataset processing
"dill>=0.3.0,<0.4.1", # tmp pin until dill has official support for determinism
# For performance gains with apache arrow
"pandas",
# for downloading datasets over HTTPS
"requests>=2.32.2",
"httpx<1.0.0",
# progress bars in downloads and data operations
"tqdm>=4.66.3",
# for fast hashing
"xxhash",
# for better multiprocessing
"multiprocess<0.70.19", # to align with dill<0.3.9 (see above)
# to save datasets locally or on any filesystem
"fsspec[http]>=2023.1.0,<=2025.10.0",
# To get datasets from the Datasets Hub on huggingface.co
"huggingface-hub>=0.25.0,<2.0",
# Utilities from PyPA to e.g., compare versions
"packaging",
# To parse YAML metadata from dataset cards
"pyyaml>=5.1",
]
From src/datasets/config.py
# Hub
HF_ENDPOINT = os.environ.get("HF_ENDPOINT", "https://huggingface.co")
# Cache location
DEFAULT_XDG_CACHE_HOME = "~/.cache"
XDG_CACHE_HOME = os.getenv("XDG_CACHE_HOME", DEFAULT_XDG_CACHE_HOME)
DEFAULT_HF_CACHE_HOME = os.path.join(XDG_CACHE_HOME, "huggingface")
HF_CACHE_HOME = os.path.expanduser(os.getenv("HF_HOME", DEFAULT_HF_CACHE_HOME))
DEFAULT_HF_DATASETS_CACHE = os.path.join(HF_CACHE_HOME, "datasets")
HF_DATASETS_CACHE = Path(os.getenv("HF_DATASETS_CACHE", DEFAULT_HF_DATASETS_CACHE))
# Offline mode
_offline = os.environ.get("HF_DATASETS_OFFLINE")
HF_HUB_OFFLINE = constants.HF_HUB_OFFLINE if _offline is None else _offline.upper() in ENV_VARS_TRUE_VALUES
# In-memory
DEFAULT_IN_MEMORY_MAX_SIZE = 0 # Disabled
IN_MEMORY_MAX_SIZE = float(os.environ.get("HF_DATASETS_IN_MEMORY_MAX_SIZE", DEFAULT_IN_MEMORY_MAX_SIZE))
# Framework selection
USE_TF = os.environ.get("USE_TF", "AUTO").upper()
USE_TORCH = os.environ.get("USE_TORCH", "AUTO").upper()
USE_JAX = os.environ.get("USE_JAX", "AUTO").upper()
Common Errors
| Error | Cause | Resolution |
|---|---|---|
ImportError: PyArrow >= 21.0.0 must be installed |
PyArrow version too old or not installed | Run pip install "pyarrow>=21.0.0"
|
ConnectionError when loading a dataset |
No internet access and offline mode not configured | Set HF_DATASETS_OFFLINE=1 and ensure the dataset is already cached locally
|
MemoryError during .map() |
Dataset too large for available RAM with in-memory mode enabled | Set HF_DATASETS_IN_MEMORY_MAX_SIZE=0 to use memory-mapped Arrow files instead
|
dill serialization errors after upgrading |
dill version outside the pinned range (>= 0.4.1) | Pin dill to >=0.3.0,<0.4.1 as required by the library
|
fsspec TypeError on protocol kwarg |
fsspec version below 2023.1.0 lacks protocol=kwargs support |
Upgrade with pip install "fsspec[http]>=2023.1.0"
|
FileNotFoundError for cache directory |
Custom HF_DATASETS_CACHE path does not exist |
Create the directory or correct the environment variable path |
multiprocess version conflict |
multiprocess and dill versions misaligned | Ensure multiprocess<0.70.19 and dill>=0.3.0,<0.4.1 are both satisfied
|
Compatibility Notes
- Python 3.14 support is declared in setup.py classifiers but some optional test dependencies (numba, joblibspark, lz4, torchcodec) do not yet have Python 3.14 wheels. The core environment is fully compatible with 3.14.
- numpy 2.x compatibility: The core environment specifies
numpy>=1.17with no upper bound, so numpy 2.x is supported. However, some optional test dependencies (faiss-cpu, tensorflow) are incompatible with numpy 2.x and are excluded from thetests_numpy2extras group. - PyArrow 21.0.0 introduced content-defined chunking for Parquet, which is used by Datasets for deterministic shard boundaries. Earlier PyArrow versions will not work.
- fsspec upper bound (<=2025.10.0) exists to prevent breakage from future fsspec API changes; this bound is periodically raised as new fsspec versions are validated.
- huggingface-hub < 2.0 upper bound guards against potential breaking API changes in a future major release.
- Framework auto-detection: By default,
USE_TF,USE_TORCH, andUSE_JAXare all set to"AUTO". The library will detect installed frameworks at import time. Setting one to"1"or"TRUE"disables auto-detection of the others (e.g., settingUSE_TF=1disables PyTorch detection). - License: The library is released under the Apache 2.0 license.
Related Pages
The following implementation pages depend on this core environment:
- Huggingface_Datasets_Dataset_Map - Dataset.map() transformation pipeline
- Huggingface_Datasets_Dataset_Filter - Dataset.filter() row selection
- Huggingface_Datasets_Load_Dataset_Builder - load_dataset_builder() factory function
- Huggingface_Datasets_DatasetBuilder_Download_and_Prepare - DatasetBuilder.download_and_prepare() pipeline
- Huggingface_Datasets_ArrowReader - Arrow file reading and deserialization
- Huggingface_Datasets_DownloadManager - Download orchestration and caching
- Huggingface_Datasets_Features - Feature type definitions and schema management