Implementation:Interpretml Interpret DataSetInteraction
| Knowledge Sources | |
|---|---|
| Domains | Machine_Learning, EBM_Core |
| Last Updated | 2026-02-07 12:00 GMT |
Overview
DataSetInteraction is a C++ module that manages dataset initialization and memory layout for the EBM interaction detection phase.
Description
This module implements the DataSetInteraction and DataSubsetInteraction classes which manage data required during interaction detection. Similar to DataSetBoosting, the dataset is divided into subsets to support both CPU and SIMD processing pipelines with different numeric precision levels.
Key responsibilities include:
- Gradient/Hessian allocation (
InitGradHess): Allocates aligned memory for gradient and hessian arrays per data subset. - Feature data initialization (
InitFeatureData): Bit-packs individual feature bin indices from the shared dataset format into SIMD-aligned integer arrays. Unlike boosting (which packs multi-dimensional term tensor indices), interaction detection packs each feature independently since interaction terms are evaluated dynamically. - Weight initialization (
InitWeights): Extracts sample weights from the shared dataset, computes total weight across all subsets, and validates against overflow. - Top-level initialization (
InitDataSetInteraction): Orchestrates subset creation, assigns CPU vs SIMD objective wrappers based on subset size, and calls all sub-initialization routines.
The interaction dataset only includes training samples (positive bag values), as interaction detection operates solely on training data.
Usage
This module is instantiated during InteractionCore::Create at the start of interaction detection. It provides the data layout used by interaction scoring algorithms (PartitionMultiDimensionalStraight, etc.) to evaluate feature interaction strengths.
Code Reference
Source Location
- Repository: Interpretml_Interpret
- File:
shared/libebm/DataSetInteraction.cpp
Signature
ErrorEbm DataSetInteraction::InitDataSetInteraction(
const bool bAllocateHessians,
const size_t cScores,
const size_t cSubsetItemsMax,
const ObjectiveWrapper* const pObjectiveCpu,
const ObjectiveWrapper* const pObjectiveSIMD,
const unsigned char* const pDataSetShared,
const size_t cSharedSamples,
const BagEbm* const aBag,
const size_t cIncludedSamples,
const size_t cWeights,
const size_t cFeatures);
void DataSetInteraction::DestructDataSetInteraction(
const size_t cFeatures);
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| bAllocateHessians | bool | Yes | Whether to allocate hessian arrays |
| cScores | size_t | Yes | Number of score outputs |
| cSubsetItemsMax | size_t | Yes | Maximum samples per data subset |
| pObjectiveCpu | const ObjectiveWrapper* | Yes | CPU objective function wrapper |
| pObjectiveSIMD | const ObjectiveWrapper* | Yes | SIMD objective function wrapper |
| pDataSetShared | const unsigned char* | Yes | Shared dataset binary blob |
| aBag | const BagEbm* | No | Bag replication array |
| cIncludedSamples | size_t | Yes | Number of training samples to include |
| cFeatures | size_t | Yes | Number of features in the dataset |
Outputs
| Name | Type | Description |
|---|---|---|
| return value | ErrorEbm | Error code (Error_None on success) |
| DataSetInteraction members | (internal) | Initialized subsets with gradient, feature data, and weight arrays |
Usage Examples
Pipeline Context
# This C++ module is called internally via the native bindings
# during interaction detection
from interpret.glassbox import ExplainableBoostingClassifier
ebm = ExplainableBoostingClassifier()
ebm.fit(X, y) # Internally creates DataSetInteraction for interaction scoring