Principle:Huggingface Datasets TensorFlow Formatting

Knowledge Sources	Huggingface Datasets HF Datasets Docs
Domains	Data_Engineering, NLP
Last Updated	2026-02-14 18:00 GMT

Overview

TensorFlow Formatting is the principle of converting Arrow table data to TensorFlow tensors for use in TensorFlow/Keras training and inference pipelines.

Description

When a dataset's format is set to "tensorflow", the TensorFlow Formatting principle governs how Arrow columns are converted to tf.Tensor objects. The conversion extracts NumPy arrays from Arrow, applies dtype defaults (int64 for integers, float32 for floats), and calls tf.convert_to_tensor() to produce the final tensors. Special handling exists for PIL images (converted to NumPy arrays), None values (passed through), and video/audio decoder objects (returned as-is). Lists of same-shaped tensors are consolidated via tf.stack, and variable-length 1-D tensors are consolidated as tf.RaggedTensor via tf.ragged.stack.

Usage

Use TensorFlow Formatting when you are training or evaluating models with TensorFlow/Keras and want the dataset's __getitem__ to return ready-to-use TF tensors. It is complementary to the to_tf_dataset method, which creates a full tf.data.Dataset pipeline.

Theoretical Basis

TensorFlow tensors are the fundamental data structure for computation in TensorFlow. Converting from Arrow to TF tensors follows the same two-step pattern as other formatters: NumPy extraction followed by framework tensor creation. The TensorFlow formatter additionally handles the conversion of PyTorch tensors (via .detach().cpu().numpy()) when both frameworks are loaded, and produces tf.RaggedTensor for variable-length sequences, which is a common pattern in NLP tasks with variable-length tokenized inputs.

Related Pages

Implemented By

Implementation:Huggingface_Datasets_TFFormatter

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment