Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Environment:Huggingface Datasets TensorFlow Integration

From Leeroopedia
Knowledge Sources
Domains Deep_Learning, Data_Processing
Last Updated 2026-02-14 19:00 GMT

Overview

TensorFlow integration in HuggingFace Datasets enables the library to produce native tf.Tensor outputs and to convert datasets into tf.data.Dataset pipelines suitable for Keras training loops. Detection is performed at import time through importlib.util.find_spec against a broad list of TensorFlow package variants, and the entire integration is gated behind a minimum major version requirement of TensorFlow 2.

Description

At startup the library reads the USE_TF environment variable (default: "AUTO"). When the value is in the auto-or-true set and PyTorch has not been explicitly forced via USE_TORCH, the detection routine runs importlib.util.find_spec("tensorflow"). If the spec is found, it iterates over multiple known package names to resolve the installed version:

  • tensorflow
  • tensorflow-cpu
  • tensorflow-gpu
  • tf-nightly
  • tf-nightly-cpu
  • tf-nightly-gpu
  • intel-tensorflow
  • tensorflow-rocm
  • tensorflow-macos

If none of these packages provide valid metadata, TF_AVAILABLE is set to False. If a version is found but its major version is less than 2, the integration is also disabled with an informational log message.

When TensorFlow is available, the TFFormatter class is registered under the format type "tensorflow" with aliases "tf". When TensorFlow is unavailable, a placeholder is registered that raises ValueError("Tensorflow needs to be installed to be able to return Tensorflow tensors.") on use.

The to_tf_dataset() method on Dataset converts a HuggingFace dataset into a tf.data.Dataset that can be passed directly to model.fit() or model.predict(). It supports batching, shuffling, custom collation, label column separation, prefetching, and multi-worker loading. A runtime check detects TPU strategies and emits a warning that the generator-based loading approach is not compatible with remote TPU connections.

Usage

Set the dataset format to TensorFlow tensors:

from datasets import load_dataset

dataset = load_dataset("glue", "mrpc", split="train")
dataset.set_format(type="tensorflow", columns=["input_ids", "attention_mask", "label"])

Convert directly to a tf.data.Dataset for Keras:

tf_dataset = dataset.to_tf_dataset(
    columns=["input_ids", "attention_mask"],
    label_cols=["label"],
    batch_size=16,
    shuffle=True,
)
model.fit(tf_dataset, epochs=3)

System Requirements

  • Python: 3.9+ (TensorFlow constraint). For Python < 3.10, TensorFlow >= 2.6.0 is required. For Python >= 3.10, TensorFlow >= 2.16.0 is required.
  • Operating System: Linux or macOS. TensorFlow tests in this repository explicitly exclude Windows (sys_platform != 'win32').
  • Python upper bound: Python >= 3.14 is excluded from TensorFlow test dependencies.
  • NumPy: TensorFlow is listed in NUMPY2_INCOMPATIBLE_LIBRARIES, meaning it is excluded from NumPy 2 test runs.

Dependencies

Dependency Version Constraint Notes
tensorflow >=2.6.0 Core extra; also accepts tensorflow-cpu, tensorflow-gpu, and other variants
protobuf <4.0.0 Required for compatibility with TensorFlow < 2.12 in test environments

Install via the extras:

pip install datasets[tensorflow]

Or for the GPU variant:

pip install datasets[tensorflow_gpu]

Credentials

No credentials are required for TensorFlow integration itself. Standard HuggingFace Hub authentication (token-based) is used when downloading datasets from the Hub but is independent of the TensorFlow environment.

Quick Install

pip install datasets[tensorflow]

To verify the integration is active:

from datasets import config
print(f"TF available: {config.TF_AVAILABLE}")
print(f"TF version:   {config.TF_VERSION}")

Code Evidence

Environment variable and detection logic (from src/datasets/config.py lines 42, 81-114):

USE_TF = os.environ.get("USE_TF", "AUTO").upper()

TF_VERSION = "N/A"
TF_AVAILABLE = False

if USE_TF in ENV_VARS_TRUE_AND_AUTO_VALUES and USE_TORCH not in ENV_VARS_TRUE_VALUES:
    TF_AVAILABLE = importlib.util.find_spec("tensorflow") is not None
    if TF_AVAILABLE:
        for package in [
            "tensorflow",
            "tensorflow-cpu",
            "tensorflow-gpu",
            "tf-nightly",
            "tf-nightly-cpu",
            "tf-nightly-gpu",
            "intel-tensorflow",
            "tensorflow-rocm",
            "tensorflow-macos",
        ]:
            try:
                TF_VERSION = version.parse(importlib.metadata.version(package))
            except importlib.metadata.PackageNotFoundError:
                continue
            else:
                break
        else:
            TF_AVAILABLE = False
    if TF_AVAILABLE:
        if TF_VERSION.major < 2:
            logger.info(f"TensorFlow found but with version {TF_VERSION}. `datasets` requires version 2 minimum.")
            TF_AVAILABLE = False

Formatter registration (from src/datasets/formatting/__init__.py lines 98-104):

if config.TF_AVAILABLE:
    from .tf_formatter import TFFormatter
    _register_formatter(TFFormatter, "tensorflow", aliases=["tf"])
else:
    _tf_error = ValueError("Tensorflow needs to be installed to be able to return Tensorflow tensors.")
    _register_unavailable_formatter(_tf_error, "tensorflow", aliases=["tf"])

TPU strategy warning (from src/datasets/arrow_dataset.py lines 415-421):

if isinstance(tf.distribute.get_strategy(), tf.distribute.TPUStrategy):
    logger.warning(
        "Note that to_tf_dataset() loads the data with a generator rather than a full tf.data "
        "pipeline and is not compatible with remote TPU connections. If you encounter errors, please "
        "try using a TPU VM or, if your data can fit in memory, loading it into memory as a dict of "
        "Tensors instead of streaming with to_tf_dataset()."
    )

Extras in setup.py (from setup.py lines 217-220):

"tensorflow": [
    "tensorflow>=2.6.0",
],
"tensorflow_gpu": ["tensorflow>=2.6.0"],

Common Errors

Error Cause Resolution
ValueError: Tensorflow needs to be installed to be able to return Tensorflow tensors. set_format(type="tensorflow") called when TensorFlow is not installed or detected Install TensorFlow: pip install tensorflow>=2.6.0
ImportError: Called a Tensorflow-specific function but Tensorflow is not installed. to_tf_dataset() called when config.TF_AVAILABLE is False Install TensorFlow or check that USE_TF is not set to a false value
TensorFlow found but with version {X}. datasets requires version 2 minimum. TensorFlow 1.x is installed Upgrade to TensorFlow >= 2.6.0
TPU warning: "not compatible with remote TPU connections" to_tf_dataset() used under a TPUStrategy Use a TPU VM instead of a remote TPU connection, or load data into memory as a dict of tensors
TF disabled because USE_TORCH is set to a true value Mutual exclusion logic in config.py: when USE_TORCH is explicitly true, TF detection is skipped Set USE_TF=1 explicitly to override, or unset USE_TORCH

Compatibility Notes

  • Mutual exclusion with PyTorch: When USE_TORCH is explicitly set to a true value, TensorFlow detection is skipped entirely. Conversely, when USE_TF is explicitly true, PyTorch is disabled. In AUTO mode both can coexist.
  • Windows: TensorFlow test dependencies carry the marker sys_platform != 'win32', indicating that TensorFlow integration is not tested or officially supported on Windows within this project.
  • NumPy 2: TensorFlow is listed in NUMPY2_INCOMPATIBLE_LIBRARIES and is excluded from the tests_numpy2 extras. Environments using NumPy >= 2.0 should not expect TensorFlow compatibility.
  • Python 3.14: TensorFlow test dependencies are restricted to python_version < '3.14'.
  • protobuf: Test environments pin protobuf<4.0.0 because protobuf 4.x breaks compatibility with TensorFlow versions prior to 2.12.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment