Environment:Huggingface Datasets TensorFlow Integration
| Knowledge Sources | |
|---|---|
| Domains | Deep_Learning, Data_Processing |
| Last Updated | 2026-02-14 19:00 GMT |
Overview
TensorFlow integration in HuggingFace Datasets enables the library to produce native tf.Tensor outputs and to convert datasets into tf.data.Dataset pipelines suitable for Keras training loops. Detection is performed at import time through importlib.util.find_spec against a broad list of TensorFlow package variants, and the entire integration is gated behind a minimum major version requirement of TensorFlow 2.
Description
At startup the library reads the USE_TF environment variable (default: "AUTO"). When the value is in the auto-or-true set and PyTorch has not been explicitly forced via USE_TORCH, the detection routine runs importlib.util.find_spec("tensorflow"). If the spec is found, it iterates over multiple known package names to resolve the installed version:
tensorflowtensorflow-cputensorflow-gputf-nightlytf-nightly-cputf-nightly-gpuintel-tensorflowtensorflow-rocmtensorflow-macos
If none of these packages provide valid metadata, TF_AVAILABLE is set to False. If a version is found but its major version is less than 2, the integration is also disabled with an informational log message.
When TensorFlow is available, the TFFormatter class is registered under the format type "tensorflow" with aliases "tf". When TensorFlow is unavailable, a placeholder is registered that raises ValueError("Tensorflow needs to be installed to be able to return Tensorflow tensors.") on use.
The to_tf_dataset() method on Dataset converts a HuggingFace dataset into a tf.data.Dataset that can be passed directly to model.fit() or model.predict(). It supports batching, shuffling, custom collation, label column separation, prefetching, and multi-worker loading. A runtime check detects TPU strategies and emits a warning that the generator-based loading approach is not compatible with remote TPU connections.
Usage
Set the dataset format to TensorFlow tensors:
from datasets import load_dataset
dataset = load_dataset("glue", "mrpc", split="train")
dataset.set_format(type="tensorflow", columns=["input_ids", "attention_mask", "label"])
Convert directly to a tf.data.Dataset for Keras:
tf_dataset = dataset.to_tf_dataset(
columns=["input_ids", "attention_mask"],
label_cols=["label"],
batch_size=16,
shuffle=True,
)
model.fit(tf_dataset, epochs=3)
System Requirements
- Python: 3.9+ (TensorFlow constraint). For Python < 3.10, TensorFlow >= 2.6.0 is required. For Python >= 3.10, TensorFlow >= 2.16.0 is required.
- Operating System: Linux or macOS. TensorFlow tests in this repository explicitly exclude Windows (
sys_platform != 'win32'). - Python upper bound: Python >= 3.14 is excluded from TensorFlow test dependencies.
- NumPy: TensorFlow is listed in
NUMPY2_INCOMPATIBLE_LIBRARIES, meaning it is excluded from NumPy 2 test runs.
Dependencies
| Dependency | Version Constraint | Notes |
|---|---|---|
tensorflow |
>=2.6.0 |
Core extra; also accepts tensorflow-cpu, tensorflow-gpu, and other variants
|
protobuf |
<4.0.0 |
Required for compatibility with TensorFlow < 2.12 in test environments |
Install via the extras:
pip install datasets[tensorflow]
Or for the GPU variant:
pip install datasets[tensorflow_gpu]
Credentials
No credentials are required for TensorFlow integration itself. Standard HuggingFace Hub authentication (token-based) is used when downloading datasets from the Hub but is independent of the TensorFlow environment.
Quick Install
pip install datasets[tensorflow]
To verify the integration is active:
from datasets import config
print(f"TF available: {config.TF_AVAILABLE}")
print(f"TF version: {config.TF_VERSION}")
Code Evidence
Environment variable and detection logic (from src/datasets/config.py lines 42, 81-114):
USE_TF = os.environ.get("USE_TF", "AUTO").upper()
TF_VERSION = "N/A"
TF_AVAILABLE = False
if USE_TF in ENV_VARS_TRUE_AND_AUTO_VALUES and USE_TORCH not in ENV_VARS_TRUE_VALUES:
TF_AVAILABLE = importlib.util.find_spec("tensorflow") is not None
if TF_AVAILABLE:
for package in [
"tensorflow",
"tensorflow-cpu",
"tensorflow-gpu",
"tf-nightly",
"tf-nightly-cpu",
"tf-nightly-gpu",
"intel-tensorflow",
"tensorflow-rocm",
"tensorflow-macos",
]:
try:
TF_VERSION = version.parse(importlib.metadata.version(package))
except importlib.metadata.PackageNotFoundError:
continue
else:
break
else:
TF_AVAILABLE = False
if TF_AVAILABLE:
if TF_VERSION.major < 2:
logger.info(f"TensorFlow found but with version {TF_VERSION}. `datasets` requires version 2 minimum.")
TF_AVAILABLE = False
Formatter registration (from src/datasets/formatting/__init__.py lines 98-104):
if config.TF_AVAILABLE:
from .tf_formatter import TFFormatter
_register_formatter(TFFormatter, "tensorflow", aliases=["tf"])
else:
_tf_error = ValueError("Tensorflow needs to be installed to be able to return Tensorflow tensors.")
_register_unavailable_formatter(_tf_error, "tensorflow", aliases=["tf"])
TPU strategy warning (from src/datasets/arrow_dataset.py lines 415-421):
if isinstance(tf.distribute.get_strategy(), tf.distribute.TPUStrategy):
logger.warning(
"Note that to_tf_dataset() loads the data with a generator rather than a full tf.data "
"pipeline and is not compatible with remote TPU connections. If you encounter errors, please "
"try using a TPU VM or, if your data can fit in memory, loading it into memory as a dict of "
"Tensors instead of streaming with to_tf_dataset()."
)
Extras in setup.py (from setup.py lines 217-220):
"tensorflow": [
"tensorflow>=2.6.0",
],
"tensorflow_gpu": ["tensorflow>=2.6.0"],
Common Errors
| Error | Cause | Resolution |
|---|---|---|
ValueError: Tensorflow needs to be installed to be able to return Tensorflow tensors. |
set_format(type="tensorflow") called when TensorFlow is not installed or detected |
Install TensorFlow: pip install tensorflow>=2.6.0
|
ImportError: Called a Tensorflow-specific function but Tensorflow is not installed. |
to_tf_dataset() called when config.TF_AVAILABLE is False |
Install TensorFlow or check that USE_TF is not set to a false value
|
TensorFlow found but with version {X}. datasets requires version 2 minimum. |
TensorFlow 1.x is installed | Upgrade to TensorFlow >= 2.6.0 |
| TPU warning: "not compatible with remote TPU connections" | to_tf_dataset() used under a TPUStrategy |
Use a TPU VM instead of a remote TPU connection, or load data into memory as a dict of tensors |
TF disabled because USE_TORCH is set to a true value |
Mutual exclusion logic in config.py: when USE_TORCH is explicitly true, TF detection is skipped |
Set USE_TF=1 explicitly to override, or unset USE_TORCH
|
Compatibility Notes
- Mutual exclusion with PyTorch: When
USE_TORCHis explicitly set to a true value, TensorFlow detection is skipped entirely. Conversely, whenUSE_TFis explicitly true, PyTorch is disabled. InAUTOmode both can coexist. - Windows: TensorFlow test dependencies carry the marker
sys_platform != 'win32', indicating that TensorFlow integration is not tested or officially supported on Windows within this project. - NumPy 2: TensorFlow is listed in
NUMPY2_INCOMPATIBLE_LIBRARIESand is excluded from thetests_numpy2extras. Environments using NumPy >= 2.0 should not expect TensorFlow compatibility. - Python 3.14: TensorFlow test dependencies are restricted to
python_version < '3.14'. - protobuf: Test environments pin
protobuf<4.0.0because protobuf 4.x breaks compatibility with TensorFlow versions prior to 2.12.
Related Pages
- Huggingface_Datasets_TFFormatter -- The
TFFormatterclass that converts Arrow tables to TensorFlow tensors - Huggingface_Datasets_Dataset_To_Tf_Dataset -- The
to_tf_dataset()method for creatingtf.data.Datasetpipelines - Huggingface_Datasets_Dataset_Set_Format -- The
set_format()method used to select the TensorFlow output format