Principle:Scikit learn Scikit learn Anomaly Detection
| Knowledge Sources | |
|---|---|
| Domains | Unsupervised Learning, Outlier Detection |
| Last Updated | 2026-02-08 15:00 GMT |
Overview
Anomaly detection identifies observations that deviate significantly from the majority of the data, flagging them as outliers or novelties.
Description
Anomaly detection methods learn what "normal" data looks like and then identify instances that do not conform to this learned pattern. The distinction between outlier detection (the training data contains outliers) and novelty detection (the training data is clean and we detect new anomalies at test time) defines two important sub-problems. These techniques solve the problem of finding unusual patterns in data without requiring labeled examples of anomalies, which are typically rare and difficult to collect. Anomaly detection is critical in fraud detection, network intrusion detection, manufacturing quality control, and medical diagnostics.
Usage
Use Isolation Forest when the dataset is high-dimensional and anomalies are expected to be isolated from normal data -- it is fast and scales well. Use Local Outlier Factor (LOF) when the data has clusters of varying density and local context is important for determining what constitutes an anomaly. Use Elliptic Envelope when the data is approximately Gaussian and you want to detect outliers based on a robust covariance estimate. Isolation Forest and Elliptic Envelope support both outlier detection and novelty detection modes, while LOF is primarily designed for outlier detection (though it has a novelty detection mode).
Theoretical Basis
Isolation Forest is based on the principle that anomalies are easier to isolate than normal points. The algorithm:
- Construct an ensemble of isolation trees, where each tree recursively partitions data by randomly selecting a feature and a random split value between the feature's min and max.
- The path length from the root to the node isolating point measures how easy it is to isolate.
- The anomaly score for a point is:
where is the average path length across trees and is the average path length of an unsuccessful search in a binary search tree of elements (normalization factor). Scores close to 1 indicate anomalies; scores close to 0.5 indicate normal points.
Local Outlier Factor (LOF) measures the local density deviation of a point relative to its neighbors:
- Compute the reachability distance of from : , where is the distance to 's -th nearest neighbor.
- Compute the local reachability density:
- The LOF score is:
LOF values significantly greater than 1 indicate that the point is in a region of lower density than its neighbors, marking it as a local outlier.
Elliptic Envelope assumes data follows a multivariate Gaussian distribution and fits a robust covariance estimate (using the Minimum Covariance Determinant, MCD). The Mahalanobis distance is then used to identify outliers:
Points with Mahalanobis distance exceeding a threshold (based on the distribution) are flagged as outliers.