Principle:Pola rs Polars Categorical Type System
| Knowledge Sources | |
|---|---|
| Domains | Type_System, Categorical_Data |
| Last Updated | 2026-02-09 09:00 GMT |
Overview
Mechanism that provides efficient encoding and comparison of categorical string data using integer indices backed by concurrent bidirectional mappings.
Description
Categorical encoding is a technique for representing string-valued columns as integer indices into a dictionary of unique values. This dramatically reduces memory usage (storing 4-byte integers instead of variable-length strings) and speeds up operations like comparison, grouping, and joining (integer operations vs. string operations). Polars implements two categorical variants: Categorical (mutable, open-ended dictionary that grows as new values are encountered) and Enum (fixed, immutable dictionary defined at creation time). The type system uses adaptive physical backing (u8, u16, or u32) to minimize memory by selecting the smallest integer type that can represent all category indices. A global registry ensures that categories with the same name and namespace share a single mapping, enabling constant-time type compatibility checks via pointer comparison rather than value comparison.
Usage
Use categorical encoding when a string column has low cardinality (many repeated values relative to unique values) and will be used in grouping, joining, or filtering operations. Choose Categorical when the set of values is unknown at data loading time. Choose Enum when the valid values are known upfront and you want strict type safety (attempting to insert an unknown value is an error). The adaptive physical type selection means no manual tuning is needed for the backing integer width.
Theoretical Basis
Dictionary Encoding: Failed to parse (syntax error): {\displaystyle \text{encoded}[i] = \text{mapping.str\_to\_cat}(\text{original}[i]) } Failed to parse (syntax error): {\displaystyle \text{decoded}[i] = \text{mapping.cat\_to\_str}(\text{encoded}[i]) }
Adaptive Physical Selection:
# Abstract algorithm
select_physical(num_categories):
if num_categories <= 255:
return U8
elif num_categories <= 65535:
return U16
elif num_categories <= 4294967295:
return U32
else:
error("too many categories")
Global Registry for Type Compatibility:
# Abstract algorithm
categories_equal(left, right):
return pointer_eq(left, right) # O(1) via Arc::ptr_eq