Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Interpretml Interpret DatasetShared

From Leeroopedia


Knowledge Sources
Domains Machine_Learning, EBM_Core
Last Updated 2026-02-07 12:00 GMT

Overview

DatasetShared is a C++ module that implements the binary serialization format for sharing dataset information between the Python/R API layer and the C++ boosting/interaction engine.

Description

This module defines the shared dataset format -- a compact binary representation that packages features, weights, and targets into a single contiguous memory block. The format uses UIntShared (32-bit) integers for all metadata to ensure cross-platform compatibility.

The module is organized into two main groups of functions:

Building functions (Measure + Fill pattern):

  • MeasureDataSetHeader / FillDataSetHeader: Compute size and write the header containing sample/feature/weight/target counts
  • MeasureFeature / FillFeature: Compute size and write binned feature data with metadata flags (missing, unseen, nominal, sparse)
  • MeasureWeight / FillWeight: Compute size and write sample weight data
  • MeasureClassificationTarget / FillClassificationTarget: Handle classification targets (integer class indices)
  • MeasureRegressionTarget / FillRegressionTarget: Handle regression targets (floating-point values)
  • CheckDataSet: Validates a completed shared dataset for internal consistency

Reading functions:

  • GetDataSetSharedHeader: Extracts header metadata from the binary blob
  • GetDataSetSharedFeature: Extracts feature metadata and data pointers
  • GetDataSetSharedWeight: Extracts weight data pointers
  • GetDataSetSharedTarget: Extracts target data pointers
  • ExtractDataSetHeader: Public API to extract header info
  • ExtractNominals: Extracts nominal feature flags
  • ExtractBinCounts: Extracts bin count per feature
  • ExtractTargetClasses: Extracts class counts per target

The binary format uses a header with offsets to each section (features, weights, targets), with magic IDs for validation (k_sharedDataSetWorkingId, k_sharedDataSetDoneId).

Usage

The Python API constructs the shared dataset by calling the Measure functions to compute sizes, allocating a buffer, then calling the Fill functions. The resulting binary blob is passed to BoosterCore::Create or InteractionCore::Create which read it using the Get functions.

Code Reference

Source Location

Signature

// Building functions
EBM_API_BODY IntEbm EBM_CALLING_CONVENTION MeasureDataSetHeader(
    IntEbm countFeatures, IntEbm countWeights, IntEbm countTargets);
EBM_API_BODY ErrorEbm EBM_CALLING_CONVENTION FillDataSetHeader(
    IntEbm countFeatures, IntEbm countWeights, IntEbm countTargets,
    IntEbm countBytesAllocated, void* fillMem);

EBM_API_BODY IntEbm EBM_CALLING_CONVENTION MeasureFeature(
    IntEbm countBins, BoolEbm isMissing, BoolEbm isUnseen,
    BoolEbm isNominal, IntEbm countSamples, const IntEbm* binIndexes);
EBM_API_BODY ErrorEbm EBM_CALLING_CONVENTION FillFeature(
    IntEbm countBins, BoolEbm isMissing, BoolEbm isUnseen,
    BoolEbm isNominal, IntEbm countSamples, const IntEbm* binIndexes,
    IntEbm countBytesAllocated, void* fillMem);

EBM_API_BODY IntEbm EBM_CALLING_CONVENTION MeasureWeight(
    IntEbm countSamples, const double* weights);
EBM_API_BODY ErrorEbm EBM_CALLING_CONVENTION FillWeight(
    IntEbm countSamples, const double* weights,
    IntEbm countBytesAllocated, void* fillMem);

EBM_API_BODY IntEbm EBM_CALLING_CONVENTION MeasureClassificationTarget(
    IntEbm countClasses, IntEbm countSamples, const IntEbm* targets);
EBM_API_BODY ErrorEbm EBM_CALLING_CONVENTION FillClassificationTarget(
    IntEbm countClasses, IntEbm countSamples, const IntEbm* targets,
    IntEbm countBytesAllocated, void* fillMem);

// Extraction functions
EBM_API_BODY ErrorEbm EBM_CALLING_CONVENTION CheckDataSet(
    IntEbm countBytesAllocated, const void* dataSet);
EBM_API_BODY ErrorEbm EBM_CALLING_CONVENTION ExtractDataSetHeader(
    const void* dataSet, IntEbm* countSamplesOut,
    IntEbm* countFeaturesOut, IntEbm* countWeightsOut,
    IntEbm* countTargetsOut);
EBM_API_BODY ErrorEbm EBM_CALLING_CONVENTION ExtractBinCounts(
    const void* dataSet, IntEbm countFeaturesVerify, IntEbm* binCountsOut);
EBM_API_BODY ErrorEbm EBM_CALLING_CONVENTION ExtractTargetClasses(
    const void* dataSet, IntEbm countTargetsVerify, IntEbm* classCountsOut);

I/O Contract

Inputs

Name Type Required Description
countFeatures IntEbm Yes Number of features in the dataset
countWeights IntEbm Yes Number of weight columns (0 or 1)
countTargets IntEbm Yes Number of target columns
countSamples IntEbm Yes Number of samples
fillMem void* Yes Pre-allocated buffer for the binary blob
countBytesAllocated IntEbm Yes Size of the allocated buffer

Outputs

Name Type Description
Measure return IntEbm Number of bytes required for the section
Fill return ErrorEbm Error code (Error_None on success)
fillMem void* (in-place) Populated binary blob

Usage Examples

Pipeline Context

# This C++ module is called internally via the native bindings
# to construct the shared dataset binary blob from Python arrays
from interpret.glassbox import ExplainableBoostingClassifier
ebm = ExplainableBoostingClassifier()
ebm.fit(X, y)  # Internally calls Measure/Fill functions to create shared dataset

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment