Principle:Scikit learn Scikit learn Decision Tree Learning

Knowledge Sources	Scikit_learn Scikit-learn Docs
Domains	Supervised Learning, Interpretable Models
Last Updated	2026-02-08 15:00 GMT

Overview

Decision tree learning constructs a tree-structured model that makes predictions by recursively partitioning the feature space along axis-aligned splits, yielding an interpretable sequence of if-then rules.

Description

Decision trees learn a hierarchy of binary decisions, each based on a single feature threshold, to partition the input space into regions with homogeneous target values. They solve both classification and regression problems while producing models that are inherently interpretable and easy to visualize. Trees handle non-linear relationships, feature interactions, and mixed feature types (numerical and categorical) without requiring feature scaling. However, individual trees are prone to overfitting and high variance, which motivates their use as base learners in ensemble methods (Random Forests, Gradient Boosting). Decision trees form one of the most fundamental and widely-used model families in machine learning.

Usage

Use decision trees when interpretability is paramount and the audience needs to understand the reasoning behind predictions. Use them as a baseline model for classification or regression before trying more complex methods. Apply pre-pruning (max depth, min samples per leaf) or post-pruning to control overfitting. Decision trees are also the foundation for ensemble methods that aggregate many trees to improve predictive performance. Use tree visualization and export capabilities to communicate model logic to stakeholders.

Theoretical Basis

Tree Construction (CART algorithm): The tree is built by recursively selecting the best split at each node:

For each candidate feature $j$ and threshold $t$ , partition the data into left ( $x_{j} \leq t$ ) and right ( $x_{j} > t$ ) subsets.
Evaluate the quality of the split using an impurity criterion.
Select the split $(j^{*}, t^{*})$ that maximizes the reduction in impurity.
Repeat recursively on each subset until a stopping criterion is met.

Classification Impurity Measures:

Gini impurity: $G (m) = 1 - \sum_{k = 1}^{K} p_{m k}^{2}$

where $p_{m k}$ is the proportion of class $k$ in node $m$ . Gini impurity is zero when all samples belong to one class.

Entropy (Information Gain): $H (m) = - \sum_{k = 1}^{K} p_{m k} \log_{2} (p_{m k})$

The information gain of a split is: $I G = H (parent) - \sum_{child} \frac{n_{child}}{n_{parent}} H (child)$

Regression Impurity Measure:

Mean Squared Error: $MSE (m) = \frac{1}{| m |} \sum_{i \in m} (y_{i} - {\bar{y}}_{m})^{2}$

where ${\bar{y}}_{m}$ is the mean target value in node $m$ .

Prediction:

Classification: Majority class in the leaf node, or the class probability distribution.
Regression: Mean (or median) of target values in the leaf node.

Pruning controls model complexity:

Pre-pruning: Limit max depth, require minimum samples per split/leaf, limit max leaf nodes.
Cost-complexity pruning (post-pruning): Minimize:

$R_{α} (T) = R (T) + α | T |$ where $R (T)$ is the training error, $| T |$ is the number of leaves, and $α$ is the complexity parameter. The optimal $α$ is chosen via cross-validation.

Feature importance is computed as the total reduction in impurity brought by each feature across all nodes: $importance (j) = \sum_{m : split on j} n_{m} \cdot Δ impurity (m)$

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment