Principle:Huggingface Datasets Scalar Value Types

Knowledge Sources	Huggingface Datasets HF Datasets Docs
Domains	Data_Engineering, NLP
Last Updated	2026-02-14 18:00 GMT

Overview

Representing scalar data types (integers, floats, strings, booleans, and temporal types) in dataset schemas provides the foundation for typed columnar storage.

Description

Scalar value types are the atomic building blocks of dataset schemas. Every non-composite column in a dataset is described by a scalar type that maps to an underlying Apache Arrow data type. The supported scalar types span numeric types (int8 through int64, uint8 through uint64, float16 through float64), text types (string, large_string, string_view), binary types (binary, large_binary, binary_view), boolean, temporal types (date32, date64, time32, time64, timestamp, duration), and decimal types (decimal128, decimal256). Each scalar type string is translated to a PyArrow type at initialization time using a string-to-arrow mapping.

Usage

Use scalar value types to define the type of any simple column in a dataset schema. They are the most common feature type and are used for text fields, numeric scores, identifiers, timestamps, and any other non-composite data.

Theoretical Basis

Scalar types bridge the gap between Python's dynamic typing and Arrow's static type system. By specifying a dtype string, the user selects an Arrow type that determines storage format, memory layout, and valid operations. The encoding method ensures Python values are coerced to the correct Python type before Arrow serialization (e.g., casting to int() for integer types, float() for floating types). Aliases like "float" for "float32" and "double" for "float64" are normalized at construction time for consistency.

Related Pages

Implemented By

Implementation:Huggingface_Datasets_Value

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment