Implementation:Pola rs Polars CategoricalMapping
| Knowledge Sources | |
|---|---|
| Domains | Categorical_Data, Concurrent_Data_Structures |
| Last Updated | 2026-02-09 09:00 GMT |
Overview
Concrete tool for concurrent bidirectional mapping between strings and categorical integer indices provided by the polars-dtype crate.
Description
CategoricalMapping is a thread-safe bidirectional mapping between string values and categorical integer indices (CatSize = u32). It uses two data structures: a RawTable<str, CatSize> hash table for fast string-to-category lookups, and a boxcar::Vec<(&'static str, u64)> lock-free append-only vector for category-to-string reverse lookups (also storing pre-computed hashes). The mapping supports concurrent inserts via atomic operations on an AtomicUsize upper bound counter that enforces the maximum category limit. The insert_cat and insert_cat_with_hash methods use try_get_or_insert_with for lock-free upsert semantics. The to_arrow method serializes the mapping to either a Utf8ViewArray or Utf8Array<i64> for Arrow interoperability.
Usage
Import CategoricalMapping when implementing custom categorical encoding logic or when extending the categorical type system. It is the core data structure backing both Categories (for mutable categorical columns) and FrozenCategories (for enum columns). The concurrent design makes it safe for parallel data ingestion where multiple threads may discover new categories simultaneously.
Code Reference
Source Location
- Repository: Pola_rs_Polars
- File: crates/polars-dtype/src/categorical/mapping.rs
- Lines: 1-184
Signature
pub struct CategoricalMapping {
str_to_cat: RawTable<str, CatSize>,
cat_to_str_and_hash: boxcar::Vec<(&'static str, u64)>,
max_categories: usize,
upper_bound: AtomicUsize,
hasher: PlSeedableRandomStateQuality,
}
impl CategoricalMapping {
pub fn with_hasher(max_categories: usize, hasher: PlSeedableRandomStateQuality) -> Self;
pub fn hasher(&self) -> &PlSeedableRandomStateQuality;
pub fn max_categories(&self) -> usize;
pub fn set_max_categories(&mut self, max_categories: usize);
pub fn get_cat(&self, s: &str) -> Option<CatSize>;
pub fn get_cat_with_hash(&self, s: &str, hash: u64) -> Option<CatSize>;
pub fn insert_cat(&self, s: &str) -> PolarsResult<CatSize>;
pub fn insert_cat_with_hash(&self, s: &str, hash: u64) -> PolarsResult<CatSize>;
pub fn cat_to_str(&self, cat: CatSize) -> Option<&str>;
pub unsafe fn cat_to_str_unchecked(&self, cat: CatSize) -> &str;
pub fn cat_to_hash(&self, cat: CatSize) -> Option<u64>;
pub unsafe fn cat_to_hash_unchecked(&self, cat: CatSize) -> u64;
pub fn num_cats_upper_bound(&self) -> usize;
pub fn len(&mut self) -> usize;
pub fn is_empty(&mut self) -> bool;
pub fn to_arrow(&self, as_views: bool) -> Box<dyn Array>;
}
Import
use polars_dtype::categorical::CategoricalMapping;
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| s (insert_cat) | &str | Yes | String value to map to a categorical index |
| cat (cat_to_str) | CatSize | Yes | Categorical index to resolve to its string |
| max_categories | usize | Yes | Maximum allowed categories (enforced on insert) |
| hash (with_hash variants) | u64 | Yes | Pre-computed hash of the string for performance |
Outputs
| Name | Type | Description |
|---|---|---|
| insert_cat returns | PolarsResult<CatSize> | Index of the string (existing or newly inserted), error if max exceeded |
| get_cat returns | Option<CatSize> | Index if string exists, None otherwise |
| cat_to_str returns | Option<&str> | String for the given index, None if out of range |
| to_arrow returns | Box<dyn Array> | Arrow array containing all category strings in order |
| num_cats_upper_bound returns | usize | Upper bound on category count (may overcount during concurrent inserts) |
Usage Examples
Basic String-to-Category Mapping
use polars_dtype::categorical::CategoricalMapping;
let mapping = CategoricalMapping::new(u32::MAX as usize);
// Insert categories
let red = mapping.insert_cat("red").unwrap();
let green = mapping.insert_cat("green").unwrap();
let blue = mapping.insert_cat("blue").unwrap();
// Categories are assigned sequential indices
assert_eq!(red, 0);
assert_eq!(green, 1);
assert_eq!(blue, 2);
// Lookup by string
assert_eq!(mapping.get_cat("green"), Some(1));
assert_eq!(mapping.get_cat("yellow"), None);
// Reverse lookup
assert_eq!(mapping.cat_to_str(0), Some("red"));
assert_eq!(mapping.cat_to_str(1), Some("green"));
Duplicate Insert Returns Existing Index
use polars_dtype::categorical::CategoricalMapping;
let mapping = CategoricalMapping::new(u32::MAX as usize);
let first = mapping.insert_cat("hello").unwrap();
let second = mapping.insert_cat("hello").unwrap();
assert_eq!(first, second); // Same index returned