Principle:Huggingface Datasets Scalar Value Types
| Knowledge Sources | |
|---|---|
| Domains | Data_Engineering, NLP |
| Last Updated | 2026-02-14 18:00 GMT |
Overview
Representing scalar data types (integers, floats, strings, booleans, and temporal types) in dataset schemas provides the foundation for typed columnar storage.
Description
Scalar value types are the atomic building blocks of dataset schemas. Every non-composite column in a dataset is described by a scalar type that maps to an underlying Apache Arrow data type. The supported scalar types span numeric types (int8 through int64, uint8 through uint64, float16 through float64), text types (string, large_string, string_view), binary types (binary, large_binary, binary_view), boolean, temporal types (date32, date64, time32, time64, timestamp, duration), and decimal types (decimal128, decimal256). Each scalar type string is translated to a PyArrow type at initialization time using a string-to-arrow mapping.
Usage
Use scalar value types to define the type of any simple column in a dataset schema. They are the most common feature type and are used for text fields, numeric scores, identifiers, timestamps, and any other non-composite data.
Theoretical Basis
Scalar types bridge the gap between Python's dynamic typing and Arrow's static type system. By specifying a dtype string, the user selects an Arrow type that determines storage format, memory layout, and valid operations. The encoding method ensures Python values are coerced to the correct Python type before Arrow serialization (e.g., casting to int() for integer types, float() for floating types). Aliases like "float" for "float32" and "double" for "float64" are normalized at construction time for consistency.