Heuristic:Interpretml Interpret Categorical Float Conversion Gotcha

Knowledge Sources	InterpretML Interpret Internal debugging of cross-language consistency issues
Domains	Data_Processing, Machine_Learning
Last Updated	2026-02-07 12:00 GMT

Overview

Critical data conversion rule: only convert `np.float64` values to strings for categorical handling; other float types produce inconsistent string representations that break cross-language compatibility.

Description

When InterpretML processes features, it must decide whether to treat numeric values as continuous or categorical. Part of this process involves converting float values to their string representations for categorical lookup. However, a subtle cross-language consistency bug exists: converting any float type other than `np.float64` to a string will produce a different representation than first converting to `np.float64` binary and then to string. This means the same underlying value could map to different categorical bins depending on the float precision of the input, breaking reproducibility. The codebase also prioritizes predict-time performance over fit-time, deferring expensive conversions to the fit phase.

Usage

Apply this heuristic when:

Processing input data with mixed float precisions (float32, float16)
Implementing categorical encoding for EBM features
Debugging mismatches between training and prediction categorical assignments
Working on cross-language compatibility (Python/R/C++)

The Insight (Rule of Thumb)

Action: Always convert float values to `np.float64` before converting to strings for categorical lookup.
Value: N/A (consistency rule, not a numeric parameter).
Trade-off: Slight overhead from the float64 conversion step; prevents silent data corruption in categorical matching.

Secondary Rule: Fit-Time vs Predict-Time Optimization

Action: Move expensive categorical dictionary construction to fit time, not predict time.
Value: Predict-time already has a categories dictionary; no performance loss.
Trade-off: Fit time is slightly longer; predict time remains fast.

Reasoning

The problem is fundamental to floating-point representation:

`np.float32(0.1)` has string repr `"0.1"` but binary value `0.100000001490116...`
`np.float64(np.float32(0.1))` has string repr `"0.10000000149011612"`
Converting float32 directly to string gives `"0.1"`, but converting float32 to float64 then to string gives `"0.10000000149011612"`
If the model was trained with float64 strings, predicting with float32 strings will mismatch

This is particularly dangerous because:

Users commonly pass `np.float32` arrays from GPU frameworks
The mismatch is silent (no error raised, just wrong bin assignment)
It only affects categorical handling, not continuous features

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment