Environment:Huggingface Datasets Python PyArrow Core

Knowledge Sources	HuggingFace Datasets
Domains	Infrastructure, Data_Processing
Last Updated	2026-02-14 19:00 GMT

Overview

This is the core runtime environment for the HuggingFace Datasets library, defining the minimum Python version, PyArrow backend, and all required system-level dependencies needed to load, process, and cache datasets.

Description

The HuggingFace Datasets library requires Python >= 3.10.0 and uses Apache Arrow (via PyArrow >= 21.0.0) as its columnar in-memory backend for high-performance data serialization and processing. The core environment encompasses all mandatory dependencies declared in setup.py under REQUIRED_PKGS, along with environment variables defined in src/datasets/config.py that control caching behavior, offline mode, in-memory limits, progress bar display, and ML framework selection.

Key architectural decisions reflected in this environment include:

PyArrow >= 21.0.0 is required to support use_content_defined_chunking in ParquetWriter, which enables deterministic shard boundaries during dataset preparation.
dill >= 0.3.0, < 0.4.1 is temporarily pinned because dill does not yet have official support for deterministic serialization (see dill#19).
multiprocess < 0.70.19 is pinned to align with the dill version constraint, as multiprocess bundles dill internally.
fsspec[http] >= 2023.1.0 is the minimum version that supports protocol=kwargs in fsspec's open, get_fs_token_paths, and related methods (see fsspec#1143).

Usage

Reference this environment page whenever you need to understand or reproduce the baseline runtime for any HuggingFace Datasets operation. All implementation pages in this wiki depend on this core environment. If you are working with optional extras (audio, vision, TensorFlow, PyTorch, JAX), consult the relevant optional environment pages in addition to this one.

System Requirements

Component	Requirement	Notes
Python	>= 3.10.0	Supports 3.10, 3.11, 3.12, 3.13, 3.14 per setup.py classifiers
Operating System	Linux, macOS, Windows	`Operating System :: OS Independent` classifier in setup.py
Disk Space	Varies	Datasets are cached to `~/.cache/huggingface/datasets` by default; large datasets may require significant disk space
Memory	Varies	In-memory mode is disabled by default (`HF_DATASETS_IN_MEMORY_MAX_SIZE=0`); Arrow memory-mapped files are used instead

Dependencies

System Packages

No system-level packages beyond Python itself are required for the core environment. All dependencies are pure Python or ship pre-built wheels:

Python >= 3.10.0 (CPython recommended)
A working C compiler is only needed if building PyArrow from source (rare; wheels are available for all supported platforms)

Python Packages

The following are the core dependencies declared in REQUIRED_PKGS in setup.py:

Package	Version Constraint	Purpose
pyarrow	>= 21.0.0	Backend columnar storage and serialization; minimum for `use_content_defined_chunking` in ParquetWriter
dill	>= 0.3.0, < 0.4.1	Smart caching of dataset processing functions; pinned pending determinism support
multiprocess	< 0.70.19	Better multiprocessing; version aligned with dill pin
fsspec[http]	>= 2023.1.0, <= 2025.10.0	Filesystem abstraction for local, remote, and cloud storage; minimum for `protocol=kwargs` support
huggingface-hub	>= 0.25.0, < 2.0	Interaction with the HuggingFace Hub (downloading, uploading, authentication)
pandas	(no version constraint)	Performance gains with Apache Arrow interchange
numpy	>= 1.17	Required for `np.random.Generator` used in Dataset shuffling
requests	>= 2.32.2	HTTP downloads of datasets
httpx	< 1.0.0	Alternative HTTP client
tqdm	>= 4.66.3	Progress bars for downloads and data operations
xxhash	(no version constraint)	Fast hashing for caching
filelock	(no version constraint)	File locking for safe concurrent access
packaging	(no version constraint)	Version parsing and comparison utilities (PyPA)
pyyaml	>= 5.1	Parsing YAML metadata from dataset cards

Credentials

No API keys or tokens are strictly required for the core environment. However, the following environment variables (defined in src/datasets/config.py) control runtime behavior:

Variable	Default	Description
`HF_ENDPOINT`	`https://huggingface.co`	Base URL for the HuggingFace Hub API
`HF_DATASETS_CACHE`	`~/.cache/huggingface/datasets`	Root directory for cached datasets
`HF_HOME`	`~/.cache/huggingface`	Root directory for all HuggingFace caches (hub, datasets, modules)
`XDG_CACHE_HOME`	`~/.cache`	XDG base directory for caches; `HF_HOME` defaults to `$XDG_CACHE_HOME/huggingface`
`HF_DATASETS_OFFLINE`	(unset)	Set to `1` to enable offline mode; falls back to `HF_HUB_OFFLINE` if unset
`HF_DATASETS_IN_MEMORY_MAX_SIZE`	`0` (disabled)	Maximum dataset size in bytes to keep entirely in memory; `0` means always use memory-mapped Arrow files
`HF_DATASETS_DISABLE_PROGRESS_BARS`	(unset)	Set to `1` to globally disable progress bars; set to `0` to force enable; unset allows programmatic control
`HF_UPDATE_DOWNLOAD_COUNTS`	`AUTO`	Whether to update download counts on the Hub when fetching a dataset
`USE_TF`	`AUTO`	Framework selection: set to `1` to force TensorFlow, `0` to disable
`USE_TORCH`	`AUTO`	Framework selection: set to `1` to force PyTorch, `0` to disable
`USE_JAX`	`AUTO`	Framework selection: set to `1` to force JAX, `0` to disable

Quick Install

# Install the core datasets library with all required dependencies
pip install datasets

# Or install a specific version
pip install datasets==4.5.0

# Verify the installation
python -c "import datasets; print(datasets.__version__)"

To install from source (main branch):

pip install "datasets @ git+https://github.com/huggingface/datasets@main#egg=datasets"

Code Evidence

From setup.py

python_requires=">=3.10.0"

REQUIRED_PKGS = [
    # For file locking
    "filelock",
    # We use numpy>=1.17 to have np.random.Generator (Dataset shuffling)
    "numpy>=1.17",
    # Backend and serialization.
    # Minimum 21.0.0 to support `use_content_defined_chunking` in ParquetWriter
    "pyarrow>=21.0.0",
    # For smart caching dataset processing
    "dill>=0.3.0,<0.4.1",  # tmp pin until dill has official support for determinism
    # For performance gains with apache arrow
    "pandas",
    # for downloading datasets over HTTPS
    "requests>=2.32.2",
    "httpx<1.0.0",
    # progress bars in downloads and data operations
    "tqdm>=4.66.3",
    # for fast hashing
    "xxhash",
    # for better multiprocessing
    "multiprocess<0.70.19",  # to align with dill<0.3.9 (see above)
    # to save datasets locally or on any filesystem
    "fsspec[http]>=2023.1.0,<=2025.10.0",
    # To get datasets from the Datasets Hub on huggingface.co
    "huggingface-hub>=0.25.0,<2.0",
    # Utilities from PyPA to e.g., compare versions
    "packaging",
    # To parse YAML metadata from dataset cards
    "pyyaml>=5.1",
]

From src/datasets/config.py

# Hub
HF_ENDPOINT = os.environ.get("HF_ENDPOINT", "https://huggingface.co")

# Cache location
DEFAULT_XDG_CACHE_HOME = "~/.cache"
XDG_CACHE_HOME = os.getenv("XDG_CACHE_HOME", DEFAULT_XDG_CACHE_HOME)
DEFAULT_HF_CACHE_HOME = os.path.join(XDG_CACHE_HOME, "huggingface")
HF_CACHE_HOME = os.path.expanduser(os.getenv("HF_HOME", DEFAULT_HF_CACHE_HOME))

DEFAULT_HF_DATASETS_CACHE = os.path.join(HF_CACHE_HOME, "datasets")
HF_DATASETS_CACHE = Path(os.getenv("HF_DATASETS_CACHE", DEFAULT_HF_DATASETS_CACHE))

# Offline mode
_offline = os.environ.get("HF_DATASETS_OFFLINE")
HF_HUB_OFFLINE = constants.HF_HUB_OFFLINE if _offline is None else _offline.upper() in ENV_VARS_TRUE_VALUES

# In-memory
DEFAULT_IN_MEMORY_MAX_SIZE = 0  # Disabled
IN_MEMORY_MAX_SIZE = float(os.environ.get("HF_DATASETS_IN_MEMORY_MAX_SIZE", DEFAULT_IN_MEMORY_MAX_SIZE))

# Framework selection
USE_TF = os.environ.get("USE_TF", "AUTO").upper()
USE_TORCH = os.environ.get("USE_TORCH", "AUTO").upper()
USE_JAX = os.environ.get("USE_JAX", "AUTO").upper()

Common Errors

Error	Cause	Resolution
`ImportError: PyArrow >= 21.0.0 must be installed`	PyArrow version too old or not installed	Run `pip install "pyarrow>=21.0.0"`
`ConnectionError` when loading a dataset	No internet access and offline mode not configured	Set `HF_DATASETS_OFFLINE=1` and ensure the dataset is already cached locally
`MemoryError` during `.map()`	Dataset too large for available RAM with in-memory mode enabled	Set `HF_DATASETS_IN_MEMORY_MAX_SIZE=0` to use memory-mapped Arrow files instead
`dill` serialization errors after upgrading	dill version outside the pinned range (>= 0.4.1)	Pin dill to `>=0.3.0,<0.4.1` as required by the library
`fsspec` `TypeError` on `protocol` kwarg	fsspec version below 2023.1.0 lacks `protocol=kwargs` support	Upgrade with `pip install "fsspec[http]>=2023.1.0"`
`FileNotFoundError` for cache directory	Custom `HF_DATASETS_CACHE` path does not exist	Create the directory or correct the environment variable path
`multiprocess` version conflict	multiprocess and dill versions misaligned	Ensure `multiprocess<0.70.19` and `dill>=0.3.0,<0.4.1` are both satisfied

Compatibility Notes

Python 3.14 support is declared in setup.py classifiers but some optional test dependencies (numba, joblibspark, lz4, torchcodec) do not yet have Python 3.14 wheels. The core environment is fully compatible with 3.14.
numpy 2.x compatibility: The core environment specifies numpy>=1.17 with no upper bound, so numpy 2.x is supported. However, some optional test dependencies (faiss-cpu, tensorflow) are incompatible with numpy 2.x and are excluded from the tests_numpy2 extras group.
PyArrow 21.0.0 introduced content-defined chunking for Parquet, which is used by Datasets for deterministic shard boundaries. Earlier PyArrow versions will not work.
fsspec upper bound (<=2025.10.0) exists to prevent breakage from future fsspec API changes; this bound is periodically raised as new fsspec versions are validated.
huggingface-hub < 2.0 upper bound guards against potential breaking API changes in a future major release.
Framework auto-detection: By default, USE_TF, USE_TORCH, and USE_JAX are all set to "AUTO". The library will detect installed frameworks at import time. Setting one to "1" or "TRUE" disables auto-detection of the others (e.g., setting USE_TF=1 disables PyTorch detection).
License: The library is released under the Apache 2.0 license.

Related Pages

The following implementation pages depend on this core environment:

Huggingface_Datasets_Dataset_Map - Dataset.map() transformation pipeline
Huggingface_Datasets_Dataset_Filter - Dataset.filter() row selection
Huggingface_Datasets_Load_Dataset_Builder - load_dataset_builder() factory function
Huggingface_Datasets_DatasetBuilder_Download_and_Prepare - DatasetBuilder.download_and_prepare() pipeline
Huggingface_Datasets_ArrowReader - Arrow file reading and deserialization
Huggingface_Datasets_DownloadManager - Download orchestration and caching
Huggingface_Datasets_Features - Feature type definitions and schema management

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment