Principle:Huggingface Datasets TensorFlow Formatting
| Knowledge Sources | |
|---|---|
| Domains | Data_Engineering, NLP |
| Last Updated | 2026-02-14 18:00 GMT |
Overview
TensorFlow Formatting is the principle of converting Arrow table data to TensorFlow tensors for use in TensorFlow/Keras training and inference pipelines.
Description
When a dataset's format is set to "tensorflow", the TensorFlow Formatting principle governs how Arrow columns are converted to tf.Tensor objects. The conversion extracts NumPy arrays from Arrow, applies dtype defaults (int64 for integers, float32 for floats), and calls tf.convert_to_tensor() to produce the final tensors. Special handling exists for PIL images (converted to NumPy arrays), None values (passed through), and video/audio decoder objects (returned as-is). Lists of same-shaped tensors are consolidated via tf.stack, and variable-length 1-D tensors are consolidated as tf.RaggedTensor via tf.ragged.stack.
Usage
Use TensorFlow Formatting when you are training or evaluating models with TensorFlow/Keras and want the dataset's __getitem__ to return ready-to-use TF tensors. It is complementary to the to_tf_dataset method, which creates a full tf.data.Dataset pipeline.
Theoretical Basis
TensorFlow tensors are the fundamental data structure for computation in TensorFlow. Converting from Arrow to TF tensors follows the same two-step pattern as other formatters: NumPy extraction followed by framework tensor creation. The TensorFlow formatter additionally handles the conversion of PyTorch tensors (via .detach().cpu().numpy()) when both frameworks are loaded, and produces tf.RaggedTensor for variable-length sequences, which is a common pattern in NLP tasks with variable-length tokenized inputs.