Implementation:Interpretml Interpret CutWinsorized
| Knowledge Sources | |
|---|---|
| Domains | Machine_Learning, EBM_Core |
| Last Updated | 2026-02-07 12:00 GMT |
Overview
CutWinsorized is a C++ module that generates winsorized cut points for feature discretization by trimming outliers from the tails of the data distribution before placing evenly spaced cuts.
Description
The CutWinsorized function implements a winsorized binning strategy where extreme values are trimmed and cut points are placed within the resulting range. The algorithm works by:
- Copying and sorting the input feature values, removing missing values and replacing infinities with the maximum/minimum representable float values.
- Determining outer boundary values by moving inward from the sorted extremes by a fraction proportional to the number of bins (
(cSamples - 1) / cBins). - Finding transition points near the center of the data when only one cut is requested.
- For multiple cuts, finding the inner transition values and placing evenly spaced cuts between them.
- Using the
ArithmeticMeanhelper to compute midpoints between transition values. - Using
FloatTickIncrementto ensure the upper boundary slightly exceeds the last observed value for proper lower-bound-inclusive bin assignment.
The algorithm handles edge cases including single-valued data, data with few distinct values, and numerical overflow in step size computation.
Usage
This module is called during the feature discretization phase when the winsorized binning strategy is selected. It provides a more robust alternative to uniform binning by reducing the influence of extreme outliers on bin boundary placement.
Code Reference
Source Location
- Repository: Interpretml_Interpret
- File:
shared/libebm/CutWinsorized.cpp
Signature
EBM_API_BODY ErrorEbm EBM_CALLING_CONVENTION CutWinsorized(
IntEbm countSamples,
const double* featureVals,
IntEbm* countCutsInOut,
double* cutsLowerBoundInclusiveOut);
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| countSamples | IntEbm | Yes | Number of feature value samples |
| featureVals | const double* | Yes | Array of feature values (may contain NaN for missing) |
| countCutsInOut | IntEbm* | Yes | Pointer to the desired number of cuts (updated on output) |
| cutsLowerBoundInclusiveOut | double* | Yes | Output buffer for cut points |
Outputs
| Name | Type | Description |
|---|---|---|
| return value | ErrorEbm | Error code (Error_None on success) |
| countCutsInOut | IntEbm* | Updated with the actual number of cuts placed |
| cutsLowerBoundInclusiveOut | double* | Array of lower-bound-inclusive cut points |
Usage Examples
Pipeline Context
# This C++ module is called internally via the native bindings
# during preprocessing when winsorized binning is selected
from interpret.glassbox import ExplainableBoostingClassifier
ebm = ExplainableBoostingClassifier()
ebm.fit(X, y) # Internally calls CutWinsorized during discretization