Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Pola rs Polars CategoricalMapping

From Leeroopedia


Knowledge Sources
Domains Categorical_Data, Concurrent_Data_Structures
Last Updated 2026-02-09 09:00 GMT

Overview

Concrete tool for concurrent bidirectional mapping between strings and categorical integer indices provided by the polars-dtype crate.

Description

CategoricalMapping is a thread-safe bidirectional mapping between string values and categorical integer indices (CatSize = u32). It uses two data structures: a RawTable<str, CatSize> hash table for fast string-to-category lookups, and a boxcar::Vec<(&'static str, u64)> lock-free append-only vector for category-to-string reverse lookups (also storing pre-computed hashes). The mapping supports concurrent inserts via atomic operations on an AtomicUsize upper bound counter that enforces the maximum category limit. The insert_cat and insert_cat_with_hash methods use try_get_or_insert_with for lock-free upsert semantics. The to_arrow method serializes the mapping to either a Utf8ViewArray or Utf8Array<i64> for Arrow interoperability.

Usage

Import CategoricalMapping when implementing custom categorical encoding logic or when extending the categorical type system. It is the core data structure backing both Categories (for mutable categorical columns) and FrozenCategories (for enum columns). The concurrent design makes it safe for parallel data ingestion where multiple threads may discover new categories simultaneously.

Code Reference

Source Location

Signature

pub struct CategoricalMapping {
    str_to_cat: RawTable<str, CatSize>,
    cat_to_str_and_hash: boxcar::Vec<(&'static str, u64)>,
    max_categories: usize,
    upper_bound: AtomicUsize,
    hasher: PlSeedableRandomStateQuality,
}

impl CategoricalMapping {
    pub fn with_hasher(max_categories: usize, hasher: PlSeedableRandomStateQuality) -> Self;
    pub fn hasher(&self) -> &PlSeedableRandomStateQuality;
    pub fn max_categories(&self) -> usize;
    pub fn set_max_categories(&mut self, max_categories: usize);
    pub fn get_cat(&self, s: &str) -> Option<CatSize>;
    pub fn get_cat_with_hash(&self, s: &str, hash: u64) -> Option<CatSize>;
    pub fn insert_cat(&self, s: &str) -> PolarsResult<CatSize>;
    pub fn insert_cat_with_hash(&self, s: &str, hash: u64) -> PolarsResult<CatSize>;
    pub fn cat_to_str(&self, cat: CatSize) -> Option<&str>;
    pub unsafe fn cat_to_str_unchecked(&self, cat: CatSize) -> &str;
    pub fn cat_to_hash(&self, cat: CatSize) -> Option<u64>;
    pub unsafe fn cat_to_hash_unchecked(&self, cat: CatSize) -> u64;
    pub fn num_cats_upper_bound(&self) -> usize;
    pub fn len(&mut self) -> usize;
    pub fn is_empty(&mut self) -> bool;
    pub fn to_arrow(&self, as_views: bool) -> Box<dyn Array>;
}

Import

use polars_dtype::categorical::CategoricalMapping;

I/O Contract

Inputs

Name Type Required Description
s (insert_cat) &str Yes String value to map to a categorical index
cat (cat_to_str) CatSize Yes Categorical index to resolve to its string
max_categories usize Yes Maximum allowed categories (enforced on insert)
hash (with_hash variants) u64 Yes Pre-computed hash of the string for performance

Outputs

Name Type Description
insert_cat returns PolarsResult<CatSize> Index of the string (existing or newly inserted), error if max exceeded
get_cat returns Option<CatSize> Index if string exists, None otherwise
cat_to_str returns Option<&str> String for the given index, None if out of range
to_arrow returns Box<dyn Array> Arrow array containing all category strings in order
num_cats_upper_bound returns usize Upper bound on category count (may overcount during concurrent inserts)

Usage Examples

Basic String-to-Category Mapping

use polars_dtype::categorical::CategoricalMapping;

let mapping = CategoricalMapping::new(u32::MAX as usize);

// Insert categories
let red = mapping.insert_cat("red").unwrap();
let green = mapping.insert_cat("green").unwrap();
let blue = mapping.insert_cat("blue").unwrap();

// Categories are assigned sequential indices
assert_eq!(red, 0);
assert_eq!(green, 1);
assert_eq!(blue, 2);

// Lookup by string
assert_eq!(mapping.get_cat("green"), Some(1));
assert_eq!(mapping.get_cat("yellow"), None);

// Reverse lookup
assert_eq!(mapping.cat_to_str(0), Some("red"));
assert_eq!(mapping.cat_to_str(1), Some("green"));

Duplicate Insert Returns Existing Index

use polars_dtype::categorical::CategoricalMapping;

let mapping = CategoricalMapping::new(u32::MAX as usize);
let first = mapping.insert_cat("hello").unwrap();
let second = mapping.insert_cat("hello").unwrap();
assert_eq!(first, second);  // Same index returned

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment