Implementation:Interpretml Interpret DatasetShared
| Knowledge Sources | |
|---|---|
| Domains | Machine_Learning, EBM_Core |
| Last Updated | 2026-02-07 12:00 GMT |
Overview
DatasetShared is a C++ module that implements the binary serialization format for sharing dataset information between the Python/R API layer and the C++ boosting/interaction engine.
Description
This module defines the shared dataset format -- a compact binary representation that packages features, weights, and targets into a single contiguous memory block. The format uses UIntShared (32-bit) integers for all metadata to ensure cross-platform compatibility.
The module is organized into two main groups of functions:
Building functions (Measure + Fill pattern):
MeasureDataSetHeader/FillDataSetHeader: Compute size and write the header containing sample/feature/weight/target countsMeasureFeature/FillFeature: Compute size and write binned feature data with metadata flags (missing, unseen, nominal, sparse)MeasureWeight/FillWeight: Compute size and write sample weight dataMeasureClassificationTarget/FillClassificationTarget: Handle classification targets (integer class indices)MeasureRegressionTarget/FillRegressionTarget: Handle regression targets (floating-point values)CheckDataSet: Validates a completed shared dataset for internal consistency
Reading functions:
GetDataSetSharedHeader: Extracts header metadata from the binary blobGetDataSetSharedFeature: Extracts feature metadata and data pointersGetDataSetSharedWeight: Extracts weight data pointersGetDataSetSharedTarget: Extracts target data pointersExtractDataSetHeader: Public API to extract header infoExtractNominals: Extracts nominal feature flagsExtractBinCounts: Extracts bin count per featureExtractTargetClasses: Extracts class counts per target
The binary format uses a header with offsets to each section (features, weights, targets), with magic IDs for validation (k_sharedDataSetWorkingId, k_sharedDataSetDoneId).
Usage
The Python API constructs the shared dataset by calling the Measure functions to compute sizes, allocating a buffer, then calling the Fill functions. The resulting binary blob is passed to BoosterCore::Create or InteractionCore::Create which read it using the Get functions.
Code Reference
Source Location
- Repository: Interpretml_Interpret
- File:
shared/libebm/dataset_shared.cpp
Signature
// Building functions
EBM_API_BODY IntEbm EBM_CALLING_CONVENTION MeasureDataSetHeader(
IntEbm countFeatures, IntEbm countWeights, IntEbm countTargets);
EBM_API_BODY ErrorEbm EBM_CALLING_CONVENTION FillDataSetHeader(
IntEbm countFeatures, IntEbm countWeights, IntEbm countTargets,
IntEbm countBytesAllocated, void* fillMem);
EBM_API_BODY IntEbm EBM_CALLING_CONVENTION MeasureFeature(
IntEbm countBins, BoolEbm isMissing, BoolEbm isUnseen,
BoolEbm isNominal, IntEbm countSamples, const IntEbm* binIndexes);
EBM_API_BODY ErrorEbm EBM_CALLING_CONVENTION FillFeature(
IntEbm countBins, BoolEbm isMissing, BoolEbm isUnseen,
BoolEbm isNominal, IntEbm countSamples, const IntEbm* binIndexes,
IntEbm countBytesAllocated, void* fillMem);
EBM_API_BODY IntEbm EBM_CALLING_CONVENTION MeasureWeight(
IntEbm countSamples, const double* weights);
EBM_API_BODY ErrorEbm EBM_CALLING_CONVENTION FillWeight(
IntEbm countSamples, const double* weights,
IntEbm countBytesAllocated, void* fillMem);
EBM_API_BODY IntEbm EBM_CALLING_CONVENTION MeasureClassificationTarget(
IntEbm countClasses, IntEbm countSamples, const IntEbm* targets);
EBM_API_BODY ErrorEbm EBM_CALLING_CONVENTION FillClassificationTarget(
IntEbm countClasses, IntEbm countSamples, const IntEbm* targets,
IntEbm countBytesAllocated, void* fillMem);
// Extraction functions
EBM_API_BODY ErrorEbm EBM_CALLING_CONVENTION CheckDataSet(
IntEbm countBytesAllocated, const void* dataSet);
EBM_API_BODY ErrorEbm EBM_CALLING_CONVENTION ExtractDataSetHeader(
const void* dataSet, IntEbm* countSamplesOut,
IntEbm* countFeaturesOut, IntEbm* countWeightsOut,
IntEbm* countTargetsOut);
EBM_API_BODY ErrorEbm EBM_CALLING_CONVENTION ExtractBinCounts(
const void* dataSet, IntEbm countFeaturesVerify, IntEbm* binCountsOut);
EBM_API_BODY ErrorEbm EBM_CALLING_CONVENTION ExtractTargetClasses(
const void* dataSet, IntEbm countTargetsVerify, IntEbm* classCountsOut);
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| countFeatures | IntEbm | Yes | Number of features in the dataset |
| countWeights | IntEbm | Yes | Number of weight columns (0 or 1) |
| countTargets | IntEbm | Yes | Number of target columns |
| countSamples | IntEbm | Yes | Number of samples |
| fillMem | void* | Yes | Pre-allocated buffer for the binary blob |
| countBytesAllocated | IntEbm | Yes | Size of the allocated buffer |
Outputs
| Name | Type | Description |
|---|---|---|
| Measure return | IntEbm | Number of bytes required for the section |
| Fill return | ErrorEbm | Error code (Error_None on success) |
| fillMem | void* (in-place) | Populated binary blob |
Usage Examples
Pipeline Context
# This C++ module is called internally via the native bindings
# to construct the shared dataset binary blob from Python arrays
from interpret.glassbox import ExplainableBoostingClassifier
ebm = ExplainableBoostingClassifier()
ebm.fit(X, y) # Internally calls Measure/Fill functions to create shared dataset