Implementation:Huggingface Datasets TFFormatter
| Knowledge Sources | |
|---|---|
| Domains | Data_Engineering, NLP |
| Last Updated | 2026-02-14 18:00 GMT |
Overview
Concrete tool for converting Arrow table data to TensorFlow tensors provided by the HuggingFace Datasets library.
Description
TFFormatter is a formatting class that extends TensorFormatter and converts Arrow data to tf.Tensor objects. It provides three main formatting methods: format_row (single example), format_column (single column), and format_batch (batch of examples). The conversion process first extracts NumPy arrays from Arrow, then recursively tensorizes the data structure using tf.convert_to_tensor(), applying default dtypes (int64 for integers, float32 for floats). Lists of same-shaped tensors are consolidated via tf.stack, and variable-length 1-D tensors are consolidated as tf.RaggedTensor via tf.ragged.stack. PyTorch tensors are gracefully handled by detaching and converting through NumPy.
Usage
TFFormatter is typically not instantiated directly by users. It is automatically selected when Dataset.with_format("tensorflow") or Dataset.set_format("tensorflow") is called. It powers the tensor conversion layer for all TensorFlow-formatted dataset access.
Code Reference
Source Location
- Repository: datasets
- File:
src/datasets/formatting/tf_formatter.py - Lines: L32-L126
Signature
class TFFormatter(TensorFormatter[Mapping, "tf.Tensor", Mapping]):
def __init__(self, features=None, token_per_repo_id=None, **tf_tensor_kwargs):
def _consolidate(self, column):
def _tensorize(self, value):
def _recursive_tensorize(self, data_struct):
def recursive_tensorize(self, data_struct: dict):
def format_row(self, pa_table: pa.Table) -> Mapping:
def format_column(self, pa_table: pa.Table) -> "tf.Tensor":
def format_batch(self, pa_table: pa.Table) -> Mapping:
Import
from datasets.formatting.tf_formatter import TFFormatter
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| features | Optional[Features] |
No | Dataset features for decoding special types (e.g., Image, Audio). |
| token_per_repo_id | Optional[dict] |
No | Authentication tokens for accessing private repositories. |
| **tf_tensor_kwargs | No | Additional keyword arguments forwarded to tf.convert_to_tensor() (e.g., dtype). | |
| pa_table | pa.Table |
Yes (for format methods) | The Arrow table to convert. Passed to format_row, format_column, or format_batch. |
Outputs
| Name | Type | Description |
|---|---|---|
| row | Mapping |
A dict mapping column names to tf.Tensor values (from format_row). |
| column | tf.Tensor |
A single tensor for the column (from format_column). |
| batch | Mapping |
A dict mapping column names to batched tf.Tensor values (from format_batch). |
Usage Examples
Basic Usage
from datasets import load_dataset
# TFFormatter is used automatically when format is "tensorflow"
ds = load_dataset("cornell-movie-review-data/rotten_tomatoes", split="train")
ds = ds.with_format("tensorflow")
# Accessing a row returns TF tensors
row = ds[0]
print(type(row["label"])) # <class 'tensorflow.python.framework.ops.EagerTensor'>
# Accessing a batch returns stacked TF tensors
batch = ds[:8]
print(batch["label"].shape) # TensorShape([8])