Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Pola rs Polars CategoricalPhysical

From Leeroopedia


Knowledge Sources
Domains Type_System, Categorical_Data
Last Updated 2026-02-09 09:00 GMT

Overview

Concrete tool for defining the categorical type system including physical backing types, mutable category registries, and globally-unique frozen category collections provided by the polars-dtype crate.

Description

This module defines the core categorical type system for Polars. CategoricalPhysical is an enum specifying the unsigned integer type (U8, U16, or U32) used to store category indices, automatically selecting the smallest type that fits the number of categories. Categories is a named, registry-backed mutable category set that maintains a 1:1 mapping between category identifiers and their backing CategoricalMapping; it uses a global CATEGORIES_REGISTRY (behind a Mutex) to ensure uniqueness by name and namespace. FrozenCategories is an immutable, globally-unique ordered collection of category strings backed by a Utf8ViewArray and a CategoricalMapping; it uses content-based hashing and a FROZEN_CATEGORIES_REGISTRY to enable constant-time equality checks via pointer comparison. The module also provides ensure_same_categories and ensure_same_frozen_categories validation functions.

Usage

Import these types when implementing or extending Polars' categorical and enum data type support. CategoricalPhysical is used when defining column schemas with categorical types. Categories is used during data ingestion when categories are being discovered incrementally. FrozenCategories is used for enum types where the set of valid values is fixed at creation time. The validation functions are used internally to ensure type compatibility during joins, concatenation, and other operations that combine categorical columns.

Code Reference

Source Location

Signature

pub enum CategoricalPhysical {
    U8,
    U16,
    U32,
}

impl CategoricalPhysical {
    pub fn max_categories(&self) -> usize;
    pub fn smallest_physical(num_cats: usize) -> PolarsResult<Self>;
    pub fn as_str(&self) -> &'static str;
}

pub struct Categories {
    id: CategoricalId,
    mapping: Mutex<Weak<CategoricalMapping>>,
}

impl Categories {
    pub fn new(name: PlSmallStr, namespace: PlSmallStr, physical: CategoricalPhysical) -> Arc<Self>;
    pub fn global() -> Arc<Self>;
    pub fn is_global(self: &Arc<Self>) -> bool;
    pub fn random(namespace: PlSmallStr, physical: CategoricalPhysical) -> Arc<Self>;
    pub fn name(&self) -> &PlSmallStr;
    pub fn namespace(&self) -> &PlSmallStr;
    pub fn physical(&self) -> CategoricalPhysical;
    pub fn hash(&self) -> u64;
    pub fn mapping(&self) -> Arc<CategoricalMapping>;
    pub fn freeze(&self) -> Arc<FrozenCategories>;
}

pub struct FrozenCategories {
    physical: CategoricalPhysical,
    combined_hash: u64,
    categories: Utf8ViewArray,
    mapping: Arc<CategoricalMapping>,
}

impl FrozenCategories {
    pub fn new<'a, I: IntoIterator<Item = &'a str>>(strings: I) -> PolarsResult<Arc<Self>>;
    pub fn categories(&self) -> &Utf8ViewArray;
    pub fn physical(&self) -> CategoricalPhysical;
    pub fn mapping(&self) -> &Arc<CategoricalMapping>;
    pub fn hash(&self) -> u64;
}

pub fn ensure_same_categories(left: &Arc<Categories>, right: &Arc<Categories>) -> PolarsResult<()>;
pub fn ensure_same_frozen_categories(left: &Arc<FrozenCategories>, right: &Arc<FrozenCategories>) -> PolarsResult<()>;

Import

use polars_dtype::categorical::{CategoricalPhysical, Categories, FrozenCategories};
use polars_dtype::categorical::{ensure_same_categories, ensure_same_frozen_categories};

I/O Contract

Inputs

Name Type Required Description
num_cats (smallest_physical) usize Yes Number of categories to find the smallest backing integer type for
name (Categories::new) PlSmallStr Yes Name identifier for the category set
namespace (Categories::new) PlSmallStr Yes Namespace to scope the category set
physical (Categories::new) CategoricalPhysical Yes Physical unsigned integer type for category indices
strings (FrozenCategories::new) IntoIterator<Item = &str> Yes Ordered, unique strings to freeze as categories

Outputs

Name Type Description
Categories::new returns Arc<Categories> Registry-backed category set (existing or newly created)
Categories::mapping returns Arc<CategoricalMapping> The bidirectional string-to-index mapping for this category set
Categories::freeze returns Arc<FrozenCategories> Immutable snapshot of the current categories
FrozenCategories::new returns PolarsResult<Arc<FrozenCategories>> Globally-unique frozen category collection (error if duplicates found)
smallest_physical returns PolarsResult<CategoricalPhysical> Smallest U8/U16/U32 type fitting the given count

Usage Examples

Selecting Physical Type

use polars_dtype::categorical::CategoricalPhysical;

// Small number of categories uses U8
let phys = CategoricalPhysical::smallest_physical(100).unwrap();
assert_eq!(phys, CategoricalPhysical::U8);

// Larger sets use U16 or U32
let phys = CategoricalPhysical::smallest_physical(1000).unwrap();
assert_eq!(phys, CategoricalPhysical::U16);

let max = CategoricalPhysical::U32.max_categories();
assert_eq!(max, u32::MAX as usize);

Working with Categories

use polars_dtype::categorical::{Categories, CategoricalPhysical};
use polars_utils::pl_str::PlSmallStr;
use std::sync::Arc;

// Create or retrieve a named category set
let cats = Categories::new(
    PlSmallStr::from("colors"),
    PlSmallStr::from("default"),
    CategoricalPhysical::U8,
);

// Insert categories via the mapping
let mapping = cats.mapping();
mapping.insert_cat("red").unwrap();
mapping.insert_cat("green").unwrap();
mapping.insert_cat("blue").unwrap();

// Freeze for use as an Enum type
let frozen = cats.freeze();
assert_eq!(frozen.categories().len(), 3);

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment