Heuristic:Interpretml Interpret EBM Hyperparameter Tuning Guide
| Knowledge Sources | |
|---|---|
| Domains | Machine_Learning, Optimization |
| Last Updated | 2026-02-07 12:00 GMT |
Overview
Comprehensive hyperparameter tuning guide for Explainable Boosting Machines, prioritized by impact on model accuracy with empirically validated recommendations.
Description
EBMs often perform well with default settings, but hyperparameter tuning can provide modest improvements. The InterpretML team has documented an empirically-validated tuning priority list based on extensive benchmarking. The key insight is that parameters are not equally important for tuning: `max_leaves` and `smoothing_rounds` have the most impact, while `outer_bags` and `max_rounds` have diminishing returns beyond their defaults. Additionally, classification and regression tasks prefer different default values for several parameters, contradicting the intuition that one set of defaults fits all.
Usage
Use this guide when tuning an EBM model for maximum accuracy. Tune parameters in priority order (top to bottom), as higher-priority parameters have more impact. Stop tuning when marginal improvement becomes negligible for your use case. This guide is particularly valuable when:
- Setting up a hyperparameter search grid
- Deciding which parameters to include in a Bayesian optimization search
- Understanding why default values were chosen
The Insight (Rule of Thumb)
Parameter Priority (Most to Least Impact)
1. max_leaves (default: 2)
- Action: Try [2, 3]. Use 3 for datasets with categoricals or sharp transitions. Use 2 for smooth continuous features.
- Trade-off: max_leaves=3 increases model complexity and reduces interpretability.
2. smoothing_rounds (default: 75 classification, 500 regression)
- Action: Search [0, 25, 50, 75, 100, 150, 200, 350, 500, 750, 1000, 1500, 2000].
- Trade-off: Higher values improve accuracy but increase fitting time. Regression benefits from more smoothing than classification.
3. learning_rate (default: 0.015 classification, 0.04 regression)
- Action: Search [0.0025, 0.005, 0.01, 0.015, 0.02, 0.03, 0.04, 0.05, 0.1, 0.2].
- Trade-off: Counterintuitively, lower is NOT always better for EBMs. Regression prefers higher rates; binary classification prefers lower; multiclass is in between.
4. interactions (default: "3x" classification, "5x" regression)
- Action: Use "Nx" format (multiples of feature count). For classification: "3.5x" is good. For regression: "25x" to "50x" is optimal but reduces interpretability.
- Trade-off: More interactions improve accuracy but reduce interpretability. Start high, then prune unimportant ones.
5. inner_bags (default: 0)
- Action: Try [0, 20]. Do NOT use values between 1-19.
- Trade-off: Setting to 20 increases fitting time by ~20x. Values of 1 to 19 make the model worse.
Parameters with Diminishing Returns (Set and Forget)
outer_bags = 14 (no benefit beyond this; use 8 on machines with < 14 cores)
max_rounds = 50000 (set to 1000000000 if fitting time is acceptable; early stopping handles the rest)
max_bins = 1024 (no benefit beyond this for most datasets)
max_interaction_bins = 64 (256 or higher gives marginal improvement at significant time cost)
early_stopping_rounds = 100 (200 is slightly better but slower; beyond 200 no benefit)
Classification vs Regression Differences
- learning_rate: Classification prefers 0.015; regression prefers 0.04
- smoothing_rounds: Classification ~75; regression ~500+
- interactions: Classification "3x"-"3.5x"; regression "5x"-"25x"+
- early_stopping_tolerance: Default 1e-5; set to 0.0 or negative for slight improvement (EBM bagging compensates for overfitting)
Reasoning
EBMs are bagged ensemble models. Individual bags can overfit slightly because ensemble averaging reduces variance. This explains why:
- `early_stopping_tolerance` of 0.0 or negative can improve accuracy (individual overfitting is compensated by averaging)
- `outer_bags` of 14 is sufficient (ensemble averaging converges)
- `inner_bags` of 20 is the threshold (sufficient bagging to compensate for using data subsets)
The classification vs regression parameter differences arise because classification objectives (log loss) have different loss surface geometry than regression objectives (RMSE). Classification benefits from more conservative learning rates and fewer smoothing rounds. Regression benefits from more aggressive learning rates and more smoothing.