Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Principle:Pola rs Polars Categorical Type System

From Leeroopedia


Knowledge Sources
Domains Type_System, Categorical_Data
Last Updated 2026-02-09 09:00 GMT

Overview

Mechanism that provides efficient encoding and comparison of categorical string data using integer indices backed by concurrent bidirectional mappings.

Description

Categorical encoding is a technique for representing string-valued columns as integer indices into a dictionary of unique values. This dramatically reduces memory usage (storing 4-byte integers instead of variable-length strings) and speeds up operations like comparison, grouping, and joining (integer operations vs. string operations). Polars implements two categorical variants: Categorical (mutable, open-ended dictionary that grows as new values are encountered) and Enum (fixed, immutable dictionary defined at creation time). The type system uses adaptive physical backing (u8, u16, or u32) to minimize memory by selecting the smallest integer type that can represent all category indices. A global registry ensures that categories with the same name and namespace share a single mapping, enabling constant-time type compatibility checks via pointer comparison rather than value comparison.

Usage

Use categorical encoding when a string column has low cardinality (many repeated values relative to unique values) and will be used in grouping, joining, or filtering operations. Choose Categorical when the set of values is unknown at data loading time. Choose Enum when the valid values are known upfront and you want strict type safety (attempting to insert an unknown value is an error). The adaptive physical type selection means no manual tuning is needed for the backing integer width.

Theoretical Basis

Dictionary Encoding: Failed to parse (syntax error): {\displaystyle \text{encoded}[i] = \text{mapping.str\_to\_cat}(\text{original}[i]) } Failed to parse (syntax error): {\displaystyle \text{decoded}[i] = \text{mapping.cat\_to\_str}(\text{encoded}[i]) }

Adaptive Physical Selection:

# Abstract algorithm
select_physical(num_categories):
    if num_categories <= 255:
        return U8
    elif num_categories <= 65535:
        return U16
    elif num_categories <= 4294967295:
        return U32
    else:
        error("too many categories")

Global Registry for Type Compatibility:

# Abstract algorithm
categories_equal(left, right):
    return pointer_eq(left, right)  # O(1) via Arc::ptr_eq

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment