Principle:Scikit learn Scikit learn Anomaly Detection

Knowledge Sources	Scikit_learn Scikit-learn Docs
Domains	Unsupervised Learning, Outlier Detection
Last Updated	2026-02-08 15:00 GMT

Overview

Anomaly detection identifies observations that deviate significantly from the majority of the data, flagging them as outliers or novelties.

Description

Anomaly detection methods learn what "normal" data looks like and then identify instances that do not conform to this learned pattern. The distinction between outlier detection (the training data contains outliers) and novelty detection (the training data is clean and we detect new anomalies at test time) defines two important sub-problems. These techniques solve the problem of finding unusual patterns in data without requiring labeled examples of anomalies, which are typically rare and difficult to collect. Anomaly detection is critical in fraud detection, network intrusion detection, manufacturing quality control, and medical diagnostics.

Usage

Use Isolation Forest when the dataset is high-dimensional and anomalies are expected to be isolated from normal data -- it is fast and scales well. Use Local Outlier Factor (LOF) when the data has clusters of varying density and local context is important for determining what constitutes an anomaly. Use Elliptic Envelope when the data is approximately Gaussian and you want to detect outliers based on a robust covariance estimate. Isolation Forest and Elliptic Envelope support both outlier detection and novelty detection modes, while LOF is primarily designed for outlier detection (though it has a novelty detection mode).

Theoretical Basis

Isolation Forest is based on the principle that anomalies are easier to isolate than normal points. The algorithm:

Construct an ensemble of isolation trees, where each tree recursively partitions data by randomly selecting a feature and a random split value between the feature's min and max.
The path length $h (x)$ from the root to the node isolating point $x$ measures how easy it is to isolate.
The anomaly score for a point is:

$s (x, n) = 2^{- E [h (x)] / c (n)}$

where $E [h (x)]$ is the average path length across trees and $c (n)$ is the average path length of an unsuccessful search in a binary search tree of $n$ elements (normalization factor). Scores close to 1 indicate anomalies; scores close to 0.5 indicate normal points.

Local Outlier Factor (LOF) measures the local density deviation of a point relative to its neighbors:

Compute the reachability distance of $x$ from $y$ : ${reach-dist}_{k} (x, y) = \max (d_{k} (y), d (x, y))$ , where $d_{k} (y)$ is the distance to $y$ 's $k$ -th nearest neighbor.
Compute the local reachability density: ${lrd}_{k} (x) = 1 / (\frac{1}{k} \sum_{y \in N_{k} (x)} {reach-dist}_{k} (x, y))$
The LOF score is:

${LOF}_{k} (x) = \frac{1}{k} \sum_{y \in N_{k} (x)} \frac{{lrd}_{k} (y)}{{lrd}_{k} (x)}$

LOF values significantly greater than 1 indicate that the point is in a region of lower density than its neighbors, marking it as a local outlier.

Elliptic Envelope assumes data follows a multivariate Gaussian distribution and fits a robust covariance estimate (using the Minimum Covariance Determinant, MCD). The Mahalanobis distance is then used to identify outliers:

$d_{M} (x) = \sqrt{(x - \hat{μ})^{T} {\hat{Σ}}^{- 1} (x - \hat{μ})}$

Points with Mahalanobis distance exceeding a threshold (based on the $χ^{2}$ distribution) are flagged as outliers.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment