Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Interpretml Interpret DataSetBoosting

From Leeroopedia


Knowledge Sources
Domains Machine_Learning, EBM_Core
Last Updated 2026-02-07 12:00 GMT

Overview

DataSetBoosting is a C++ module that manages the initialization, memory allocation, and data layout for training and validation datasets used during the EBM boosting process.

Description

This module implements the DataSetBoosting and DataSubsetBoosting classes which manage all data required during the boosting phase. The dataset is divided into subsets to support both CPU and SIMD processing, with each subset potentially using different numeric precision.

Key responsibilities include:

  • Gradient/Hessian allocation (InitGradHess): Allocates aligned memory for gradient and hessian arrays per subset, respecting the objective function's float byte size.
  • Sample score initialization (InitSampleScores): Copies initial scores (intercepts plus user-provided init scores) into SIMD-aligned layout with interleaved partition indexing.
  • Target data initialization (InitTargetData): Extracts target values from the shared dataset, handling both classification (integer targets) and regression (float targets).
  • Term data initialization (InitTermData): Bit-packs feature bin indices for each interaction term into SIMD-aligned integer arrays, supporting multi-dimensional tensor indexing.
  • Weight copying (CopyWeights): Extracts sample weights from the shared dataset.
  • Inner bag initialization (InitBags): Implements bagging (bootstrap sampling with replacement) for inner bags, computing per-bag bin counts and weights for each term.

All initialization methods handle the bag replication protocol, where positive replication values indicate training samples and negative values indicate validation samples.

Usage

This module is instantiated at the beginning of the boosting process (during BoosterCore::Create) and persists for the entire boosting session. It provides the data layout that all boosting operations (bin summing, gradient computation, split finding) operate on.

Code Reference

Source Location

Signature

ErrorEbm DataSetBoosting::InitDataSetBoosting(
    const bool bAllocateGradients,
    const bool bAllocateHessians,
    const bool bAllocateSampleScores,
    const bool bAllocateTargetData,
    const bool bAllocateCachedTensors,
    void* const rng,
    const size_t cScores,
    const size_t cSubsetItemsMax,
    const ObjectiveWrapper* const pObjectiveCpu,
    const ObjectiveWrapper* const pObjectiveSIMD,
    const unsigned char* const pDataSetShared,
    const double* const aIntercept,
    const BagEbm direction,
    const size_t cSharedSamples,
    const BagEbm* const aBag,
    const double* const aInitScores,
    const size_t cIncludedSamples,
    const size_t cInnerBags,
    const size_t cWeights,
    const size_t cTerms,
    const Term* const* const apTerms,
    const IntEbm* const aiTermFeatures);

void DataSetBoosting::DestructDataSetBoosting(
    const size_t cTerms,
    const size_t cInnerBags);

I/O Contract

Inputs

Name Type Required Description
bAllocateGradients bool Yes Whether to allocate gradient arrays
bAllocateHessians bool Yes Whether to allocate hessian arrays
cScores size_t Yes Number of score outputs (1 for regression/binary, N for multiclass)
cSubsetItemsMax size_t Yes Maximum samples per data subset (for 32-bit overflow prevention)
pObjectiveCpu const ObjectiveWrapper* Yes CPU objective function wrapper
pObjectiveSIMD const ObjectiveWrapper* Yes SIMD objective function wrapper
pDataSetShared const unsigned char* Yes Shared dataset binary blob
aBag const BagEbm* No Bag replication array (positive=train, negative=validation)
cInnerBags size_t Yes Number of inner bags for bagging
apTerms const Term* const* Yes Array of term pointers defining interaction terms

Outputs

Name Type Description
return value ErrorEbm Error code (Error_None on success)
DataSetBoosting members (internal) Initialized subsets with gradient, score, target, and term data arrays

Usage Examples

Pipeline Context

# This C++ module is called internally via the native bindings
# when creating the boosting core during fit()
from interpret.glassbox import ExplainableBoostingClassifier
ebm = ExplainableBoostingClassifier()
ebm.fit(X, y)  # Internally creates DataSetBoosting for train/validation

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment