Principle:Huggingface Datasets Streaming Output Format
| Knowledge Sources | |
|---|---|
| Domains | Data_Engineering, NLP |
| Last Updated | 2026-02-14 18:00 GMT |
Overview
Configuring the output format of streamed elements controls how dataset values are represented when yielded, enabling seamless integration with framework-specific tensor types.
Description
By default, streaming datasets yield examples as Python dictionaries with native Python types (strings, integers, lists). However, deep learning frameworks require data in their own tensor formats (PyTorch tensors, NumPy arrays, TensorFlow tensors, JAX arrays). The output format configuration determines how each yielded element is converted before reaching the consumer.
Key aspects of streaming output format:
- Format types: Supported formats include
None(Python objects, the default),'numpy','torch','tensorflow','jax','arrow','pandas', and'polars'. - Lazy conversion: The format configuration is stored on the dataset but the actual conversion happens during iteration. Each element is converted to the target format as it is yielded.
- Propagation through transforms: When a formatted dataset is mapped or filtered, the formatting configuration is propagated to the resulting dataset, ensuring consistent output types throughout the pipeline.
- No data copying overhead: Since formatting is applied per-element at iteration time, there is no bulk conversion step. This aligns with the streaming philosophy of minimal memory usage.
The output format is particularly important when using streaming datasets with PyTorch's DataLoader, where tensors are expected for automatic batching and device transfer.
Usage
Use streaming output format configuration when:
- You are feeding streaming data into a PyTorch, TensorFlow, or JAX training loop that expects framework-native tensors.
- You want NumPy arrays for numerical processing or analysis.
- You need Arrow tables or Pandas DataFrames for downstream data processing.
- You want to ensure consistent data types throughout a chain of lazy transformations.
Theoretical Basis
Output format configuration implements the adapter pattern: it wraps the raw data representation with a conversion layer that presents the data in the format expected by the consumer. This decouples the data source (which produces Python dictionaries) from the data consumer (which expects framework-specific types).
The conversion is applied as a final-stage transformation in the iteration pipeline. It sits between the last processing step (map, filter, etc.) and the consumer, acting as a serialization boundary between the dataset's internal representation and the external API.
From a type-theoretic perspective, the format configuration defines a morphism between the category of Python dictionaries and the category of framework-specific tensor structures, applied lazily at each element.