Implementation:Interpretml Interpret DataSetBoosting

Knowledge Sources	Interpretml Interpret
Domains	Machine_Learning, EBM_Core
Last Updated	2026-02-07 12:00 GMT

Overview

DataSetBoosting is a C++ module that manages the initialization, memory allocation, and data layout for training and validation datasets used during the EBM boosting process.

Description

This module implements the DataSetBoosting and DataSubsetBoosting classes which manage all data required during the boosting phase. The dataset is divided into subsets to support both CPU and SIMD processing, with each subset potentially using different numeric precision.

Key responsibilities include:

Gradient/Hessian allocation (InitGradHess): Allocates aligned memory for gradient and hessian arrays per subset, respecting the objective function's float byte size.
Sample score initialization (InitSampleScores): Copies initial scores (intercepts plus user-provided init scores) into SIMD-aligned layout with interleaved partition indexing.
Target data initialization (InitTargetData): Extracts target values from the shared dataset, handling both classification (integer targets) and regression (float targets).
Term data initialization (InitTermData): Bit-packs feature bin indices for each interaction term into SIMD-aligned integer arrays, supporting multi-dimensional tensor indexing.
Weight copying (CopyWeights): Extracts sample weights from the shared dataset.
Inner bag initialization (InitBags): Implements bagging (bootstrap sampling with replacement) for inner bags, computing per-bag bin counts and weights for each term.

All initialization methods handle the bag replication protocol, where positive replication values indicate training samples and negative values indicate validation samples.

Usage

This module is instantiated at the beginning of the boosting process (during BoosterCore::Create) and persists for the entire boosting session. It provides the data layout that all boosting operations (bin summing, gradient computation, split finding) operate on.

Code Reference

Source Location

Repository: Interpretml_Interpret
File: shared/libebm/DataSetBoosting.cpp

Signature

ErrorEbm DataSetBoosting::InitDataSetBoosting(
    const bool bAllocateGradients,
    const bool bAllocateHessians,
    const bool bAllocateSampleScores,
    const bool bAllocateTargetData,
    const bool bAllocateCachedTensors,
    void* const rng,
    const size_t cScores,
    const size_t cSubsetItemsMax,
    const ObjectiveWrapper* const pObjectiveCpu,
    const ObjectiveWrapper* const pObjectiveSIMD,
    const unsigned char* const pDataSetShared,
    const double* const aIntercept,
    const BagEbm direction,
    const size_t cSharedSamples,
    const BagEbm* const aBag,
    const double* const aInitScores,
    const size_t cIncludedSamples,
    const size_t cInnerBags,
    const size_t cWeights,
    const size_t cTerms,
    const Term* const* const apTerms,
    const IntEbm* const aiTermFeatures);

void DataSetBoosting::DestructDataSetBoosting(
    const size_t cTerms,
    const size_t cInnerBags);

I/O Contract

Inputs

Name	Type	Required	Description
bAllocateGradients	bool	Yes	Whether to allocate gradient arrays
bAllocateHessians	bool	Yes	Whether to allocate hessian arrays
cScores	size_t	Yes	Number of score outputs (1 for regression/binary, N for multiclass)
cSubsetItemsMax	size_t	Yes	Maximum samples per data subset (for 32-bit overflow prevention)
pObjectiveCpu	const ObjectiveWrapper*	Yes	CPU objective function wrapper
pObjectiveSIMD	const ObjectiveWrapper*	Yes	SIMD objective function wrapper
pDataSetShared	const unsigned char*	Yes	Shared dataset binary blob
aBag	const BagEbm*	No	Bag replication array (positive=train, negative=validation)
cInnerBags	size_t	Yes	Number of inner bags for bagging
apTerms	const Term* const*	Yes	Array of term pointers defining interaction terms

Outputs

Name	Type	Description
return value	ErrorEbm	Error code (Error_None on success)
DataSetBoosting members	(internal)	Initialized subsets with gradient, score, target, and term data arrays

Usage Examples

Pipeline Context

# This C++ module is called internally via the native bindings
# when creating the boosting core during fit()
from interpret.glassbox import ExplainableBoostingClassifier
ebm = ExplainableBoostingClassifier()
ebm.fit(X, y)  # Internally creates DataSetBoosting for train/validation

Related Pages

Environment:Interpretml_Interpret_Native_Libebm_Environment

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment