Principle:Scikit learn Scikit learn Missing Data Imputation

Knowledge Sources	Scikit_learn Scikit-learn Docs
Domains	Data Preprocessing, Statistical Inference
Last Updated	2026-02-08 15:00 GMT

Overview

Missing data imputation fills in absent values in a dataset using statistical or machine learning methods, enabling downstream analyses that require complete data.

Description

Real-world datasets frequently contain missing values due to data collection errors, sensor failures, survey non-response, or data integration issues. Most machine learning algorithms cannot handle missing values directly, making imputation a necessary preprocessing step. Imputation methods range from simple strategies (filling with the mean or median) to sophisticated multivariate approaches that model the relationships between features to predict missing values. The choice of imputation method can significantly affect model performance, as naive approaches may introduce bias while sophisticated methods better preserve the statistical properties of the data.

Usage

Use simple imputation (mean, median, most frequent, or constant value) as a quick baseline or when the fraction of missing data is small and the missingness mechanism is likely random. Use KNN imputation when similar observations are expected to have similar feature values, exploiting local structure in the data. Use iterative imputation (analogous to the MICE algorithm) when features are correlated and a multivariate approach can leverage inter-feature relationships to produce more accurate estimates. Iterative imputation is the most flexible and powerful approach but is also the most computationally expensive.

Theoretical Basis

Missing Data Mechanisms:

MCAR (Missing Completely at Random): The probability of missingness does not depend on any observed or unobserved data.
MAR (Missing at Random): The probability of missingness depends on observed data but not on the missing values themselves.
MNAR (Missing Not at Random): The probability of missingness depends on the missing values.

Simple Imputation replaces missing values with a single statistic computed from the observed values:

Mean: ${\hat{x}}_{i j} = {\bar{x}}_{j} = \frac{1}{n_{obs}} \sum_{i : x_{i j} observed} x_{i j}$
Median: ${\hat{x}}_{i j} = median ({x_{i j} : x_{i j} observed})$

KNN Imputation estimates missing values from the $k$ nearest neighbors that have observed values for the feature:

${\hat{x}}_{i j} = \frac{\sum_{l \in N_{k} (i)} w_{l} \cdot x_{l j}}{\sum_{l \in N_{k} (i)} w_{l}}$

where $N_{k} (i)$ is the set of $k$ nearest neighbors of sample $i$ (using only features observed in both) and $w_{l}$ are distance-based weights.

Iterative Imputation (MICE - Multiple Imputation by Chained Equations):

Initialize missing values (e.g., with the mean).
For each feature j with missing values:
1. Fit a regression model predicting feature $j$ from all other features, using only rows where $j$ is observed.
2. Use the fitted model to predict (impute) the missing values of feature $j$ .
Repeat step 2 for a fixed number of rounds or until convergence.

The iterative procedure models the conditional distribution of each feature given all others:

$p (x_{j}^{miss} | x_{- j})$

This approach can use any regression estimator as the underlying model (e.g., Bayesian Ridge, Random Forest, KNN).

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment