Implementation:Pola rs Polars CategoricalPhysical
| Knowledge Sources | |
|---|---|
| Domains | Type_System, Categorical_Data |
| Last Updated | 2026-02-09 09:00 GMT |
Overview
Concrete tool for defining the categorical type system including physical backing types, mutable category registries, and globally-unique frozen category collections provided by the polars-dtype crate.
Description
This module defines the core categorical type system for Polars. CategoricalPhysical is an enum specifying the unsigned integer type (U8, U16, or U32) used to store category indices, automatically selecting the smallest type that fits the number of categories. Categories is a named, registry-backed mutable category set that maintains a 1:1 mapping between category identifiers and their backing CategoricalMapping; it uses a global CATEGORIES_REGISTRY (behind a Mutex) to ensure uniqueness by name and namespace. FrozenCategories is an immutable, globally-unique ordered collection of category strings backed by a Utf8ViewArray and a CategoricalMapping; it uses content-based hashing and a FROZEN_CATEGORIES_REGISTRY to enable constant-time equality checks via pointer comparison. The module also provides ensure_same_categories and ensure_same_frozen_categories validation functions.
Usage
Import these types when implementing or extending Polars' categorical and enum data type support. CategoricalPhysical is used when defining column schemas with categorical types. Categories is used during data ingestion when categories are being discovered incrementally. FrozenCategories is used for enum types where the set of valid values is fixed at creation time. The validation functions are used internally to ensure type compatibility during joins, concatenation, and other operations that combine categorical columns.
Code Reference
Source Location
- Repository: Pola_rs_Polars
- File: crates/polars-dtype/src/categorical/mod.rs
- Lines: 1-369
Signature
pub enum CategoricalPhysical {
U8,
U16,
U32,
}
impl CategoricalPhysical {
pub fn max_categories(&self) -> usize;
pub fn smallest_physical(num_cats: usize) -> PolarsResult<Self>;
pub fn as_str(&self) -> &'static str;
}
pub struct Categories {
id: CategoricalId,
mapping: Mutex<Weak<CategoricalMapping>>,
}
impl Categories {
pub fn new(name: PlSmallStr, namespace: PlSmallStr, physical: CategoricalPhysical) -> Arc<Self>;
pub fn global() -> Arc<Self>;
pub fn is_global(self: &Arc<Self>) -> bool;
pub fn random(namespace: PlSmallStr, physical: CategoricalPhysical) -> Arc<Self>;
pub fn name(&self) -> &PlSmallStr;
pub fn namespace(&self) -> &PlSmallStr;
pub fn physical(&self) -> CategoricalPhysical;
pub fn hash(&self) -> u64;
pub fn mapping(&self) -> Arc<CategoricalMapping>;
pub fn freeze(&self) -> Arc<FrozenCategories>;
}
pub struct FrozenCategories {
physical: CategoricalPhysical,
combined_hash: u64,
categories: Utf8ViewArray,
mapping: Arc<CategoricalMapping>,
}
impl FrozenCategories {
pub fn new<'a, I: IntoIterator<Item = &'a str>>(strings: I) -> PolarsResult<Arc<Self>>;
pub fn categories(&self) -> &Utf8ViewArray;
pub fn physical(&self) -> CategoricalPhysical;
pub fn mapping(&self) -> &Arc<CategoricalMapping>;
pub fn hash(&self) -> u64;
}
pub fn ensure_same_categories(left: &Arc<Categories>, right: &Arc<Categories>) -> PolarsResult<()>;
pub fn ensure_same_frozen_categories(left: &Arc<FrozenCategories>, right: &Arc<FrozenCategories>) -> PolarsResult<()>;
Import
use polars_dtype::categorical::{CategoricalPhysical, Categories, FrozenCategories};
use polars_dtype::categorical::{ensure_same_categories, ensure_same_frozen_categories};
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| num_cats (smallest_physical) | usize | Yes | Number of categories to find the smallest backing integer type for |
| name (Categories::new) | PlSmallStr | Yes | Name identifier for the category set |
| namespace (Categories::new) | PlSmallStr | Yes | Namespace to scope the category set |
| physical (Categories::new) | CategoricalPhysical | Yes | Physical unsigned integer type for category indices |
| strings (FrozenCategories::new) | IntoIterator<Item = &str> | Yes | Ordered, unique strings to freeze as categories |
Outputs
| Name | Type | Description |
|---|---|---|
| Categories::new returns | Arc<Categories> | Registry-backed category set (existing or newly created) |
| Categories::mapping returns | Arc<CategoricalMapping> | The bidirectional string-to-index mapping for this category set |
| Categories::freeze returns | Arc<FrozenCategories> | Immutable snapshot of the current categories |
| FrozenCategories::new returns | PolarsResult<Arc<FrozenCategories>> | Globally-unique frozen category collection (error if duplicates found) |
| smallest_physical returns | PolarsResult<CategoricalPhysical> | Smallest U8/U16/U32 type fitting the given count |
Usage Examples
Selecting Physical Type
use polars_dtype::categorical::CategoricalPhysical;
// Small number of categories uses U8
let phys = CategoricalPhysical::smallest_physical(100).unwrap();
assert_eq!(phys, CategoricalPhysical::U8);
// Larger sets use U16 or U32
let phys = CategoricalPhysical::smallest_physical(1000).unwrap();
assert_eq!(phys, CategoricalPhysical::U16);
let max = CategoricalPhysical::U32.max_categories();
assert_eq!(max, u32::MAX as usize);
Working with Categories
use polars_dtype::categorical::{Categories, CategoricalPhysical};
use polars_utils::pl_str::PlSmallStr;
use std::sync::Arc;
// Create or retrieve a named category set
let cats = Categories::new(
PlSmallStr::from("colors"),
PlSmallStr::from("default"),
CategoricalPhysical::U8,
);
// Insert categories via the mapping
let mapping = cats.mapping();
mapping.insert_cat("red").unwrap();
mapping.insert_cat("green").unwrap();
mapping.insert_cat("blue").unwrap();
// Freeze for use as an Enum type
let frozen = cats.freeze();
assert_eq!(frozen.categories().len(), 3);