Implementation:Interpretml Interpret DataSetBoosting
| Knowledge Sources | |
|---|---|
| Domains | Machine_Learning, EBM_Core |
| Last Updated | 2026-02-07 12:00 GMT |
Overview
DataSetBoosting is a C++ module that manages the initialization, memory allocation, and data layout for training and validation datasets used during the EBM boosting process.
Description
This module implements the DataSetBoosting and DataSubsetBoosting classes which manage all data required during the boosting phase. The dataset is divided into subsets to support both CPU and SIMD processing, with each subset potentially using different numeric precision.
Key responsibilities include:
- Gradient/Hessian allocation (
InitGradHess): Allocates aligned memory for gradient and hessian arrays per subset, respecting the objective function's float byte size. - Sample score initialization (
InitSampleScores): Copies initial scores (intercepts plus user-provided init scores) into SIMD-aligned layout with interleaved partition indexing. - Target data initialization (
InitTargetData): Extracts target values from the shared dataset, handling both classification (integer targets) and regression (float targets). - Term data initialization (
InitTermData): Bit-packs feature bin indices for each interaction term into SIMD-aligned integer arrays, supporting multi-dimensional tensor indexing. - Weight copying (
CopyWeights): Extracts sample weights from the shared dataset. - Inner bag initialization (
InitBags): Implements bagging (bootstrap sampling with replacement) for inner bags, computing per-bag bin counts and weights for each term.
All initialization methods handle the bag replication protocol, where positive replication values indicate training samples and negative values indicate validation samples.
Usage
This module is instantiated at the beginning of the boosting process (during BoosterCore::Create) and persists for the entire boosting session. It provides the data layout that all boosting operations (bin summing, gradient computation, split finding) operate on.
Code Reference
Source Location
- Repository: Interpretml_Interpret
- File:
shared/libebm/DataSetBoosting.cpp
Signature
ErrorEbm DataSetBoosting::InitDataSetBoosting(
const bool bAllocateGradients,
const bool bAllocateHessians,
const bool bAllocateSampleScores,
const bool bAllocateTargetData,
const bool bAllocateCachedTensors,
void* const rng,
const size_t cScores,
const size_t cSubsetItemsMax,
const ObjectiveWrapper* const pObjectiveCpu,
const ObjectiveWrapper* const pObjectiveSIMD,
const unsigned char* const pDataSetShared,
const double* const aIntercept,
const BagEbm direction,
const size_t cSharedSamples,
const BagEbm* const aBag,
const double* const aInitScores,
const size_t cIncludedSamples,
const size_t cInnerBags,
const size_t cWeights,
const size_t cTerms,
const Term* const* const apTerms,
const IntEbm* const aiTermFeatures);
void DataSetBoosting::DestructDataSetBoosting(
const size_t cTerms,
const size_t cInnerBags);
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| bAllocateGradients | bool | Yes | Whether to allocate gradient arrays |
| bAllocateHessians | bool | Yes | Whether to allocate hessian arrays |
| cScores | size_t | Yes | Number of score outputs (1 for regression/binary, N for multiclass) |
| cSubsetItemsMax | size_t | Yes | Maximum samples per data subset (for 32-bit overflow prevention) |
| pObjectiveCpu | const ObjectiveWrapper* | Yes | CPU objective function wrapper |
| pObjectiveSIMD | const ObjectiveWrapper* | Yes | SIMD objective function wrapper |
| pDataSetShared | const unsigned char* | Yes | Shared dataset binary blob |
| aBag | const BagEbm* | No | Bag replication array (positive=train, negative=validation) |
| cInnerBags | size_t | Yes | Number of inner bags for bagging |
| apTerms | const Term* const* | Yes | Array of term pointers defining interaction terms |
Outputs
| Name | Type | Description |
|---|---|---|
| return value | ErrorEbm | Error code (Error_None on success) |
| DataSetBoosting members | (internal) | Initialized subsets with gradient, score, target, and term data arrays |
Usage Examples
Pipeline Context
# This C++ module is called internally via the native bindings
# when creating the boosting core during fit()
from interpret.glassbox import ExplainableBoostingClassifier
ebm = ExplainableBoostingClassifier()
ebm.fit(X, y) # Internally creates DataSetBoosting for train/validation