Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Principle:Scikit learn Scikit learn Missing Data Imputation

From Leeroopedia


Knowledge Sources
Domains Data Preprocessing, Statistical Inference
Last Updated 2026-02-08 15:00 GMT

Overview

Missing data imputation fills in absent values in a dataset using statistical or machine learning methods, enabling downstream analyses that require complete data.

Description

Real-world datasets frequently contain missing values due to data collection errors, sensor failures, survey non-response, or data integration issues. Most machine learning algorithms cannot handle missing values directly, making imputation a necessary preprocessing step. Imputation methods range from simple strategies (filling with the mean or median) to sophisticated multivariate approaches that model the relationships between features to predict missing values. The choice of imputation method can significantly affect model performance, as naive approaches may introduce bias while sophisticated methods better preserve the statistical properties of the data.

Usage

Use simple imputation (mean, median, most frequent, or constant value) as a quick baseline or when the fraction of missing data is small and the missingness mechanism is likely random. Use KNN imputation when similar observations are expected to have similar feature values, exploiting local structure in the data. Use iterative imputation (analogous to the MICE algorithm) when features are correlated and a multivariate approach can leverage inter-feature relationships to produce more accurate estimates. Iterative imputation is the most flexible and powerful approach but is also the most computationally expensive.

Theoretical Basis

Missing Data Mechanisms:

  • MCAR (Missing Completely at Random): The probability of missingness does not depend on any observed or unobserved data.
  • MAR (Missing at Random): The probability of missingness depends on observed data but not on the missing values themselves.
  • MNAR (Missing Not at Random): The probability of missingness depends on the missing values.

Simple Imputation replaces missing values with a single statistic computed from the observed values:

  • Mean: x^ij=x¯j=1nobsi:xij observedxij
  • Median: x^ij=median({xij:xij observed})

KNN Imputation estimates missing values from the k nearest neighbors that have observed values for the feature:

x^ij=lNk(i)wlxljlNk(i)wl

where Nk(i) is the set of k nearest neighbors of sample i (using only features observed in both) and wl are distance-based weights.

Iterative Imputation (MICE - Multiple Imputation by Chained Equations):

  1. Initialize missing values (e.g., with the mean).
  2. For each feature j with missing values:
    1. Fit a regression model predicting feature j from all other features, using only rows where j is observed.
    2. Use the fitted model to predict (impute) the missing values of feature j.
  3. Repeat step 2 for a fixed number of rounds or until convergence.

The iterative procedure models the conditional distribution of each feature given all others:

p(xjmiss|xj)

This approach can use any regression estimator as the underlying model (e.g., Bayesian Ridge, Random Forest, KNN).

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment