Principle:SeldonIO Seldon core Drift And Outlier Detection Training
| Property | Value |
|---|---|
| Principle Name | Drift And Outlier Detection Training |
| Overview | Statistical methods for training drift detectors and outlier detectors to monitor production ML model inputs |
| Domains | MLOps, Statistical_Testing, Anomaly_Detection |
| Related Implementation | SeldonIO_Seldon_core_Alibi_Detect_Training |
| Knowledge Sources | Paper (alibi-detect: https://arxiv.org/abs/2311.01096), Doc (https://docs.seldon.io/projects/alibi-detect) |
| Last Updated | 2026-02-13 00:00 GMT |
Description
Production ML systems require monitoring for data drift (distribution shift between training and production data) and outliers (anomalous inputs). The alibi-detect library provides two key detector types for this purpose:
- TabularDrift for multivariate drift testing using chi-squared and Kolmogorov-Smirnov statistics
- OutlierVAE for reconstruction-based outlier detection using variational autoencoders
These detectors are trained on reference data from the training distribution and then deployed alongside the production classifier to continuously monitor incoming data quality.
Theoretical Basis
Drift Detection
Drift detection uses statistical hypothesis testing. TabularDrift applies per-feature tests against a reference distribution:
- Chi-squared tests for categorical features
- Kolmogorov-Smirnov tests for continuous features
- Bonferroni correction for multiple testing across features
The null hypothesis is that the reference and test distributions are equal. When the corrected p-value falls below the threshold, drift is declared.
Outlier Detection
OutlierVAE trains a Variational Autoencoder (VAE) to reconstruct normal data. The VAE's latent space compresses input features into a low-dimensional representation. Outliers produce high reconstruction error (MSE) exceeding a learned threshold, making reconstruction error a sensitive anomaly metric.
The VAE learns the manifold of normal data during training. At inference time, inputs that lie far from this manifold cannot be faithfully reconstructed, resulting in elevated reconstruction error.
Mathematical Formulation
TabularDrift
H0: P_ref = P_test
p_val threshold = 0.05
Per-feature test:
- Categorical: chi-squared test statistic
- Continuous: KS test statistic D = sup|F_ref(x) - F_test(x)|
Multiple testing correction:
- Bonferroni: adjusted p_val = p_val * n_features
OutlierVAE
outlier_score = MSE(x, VAE(x))
is_outlier = (outlier_score > threshold)
Where:
VAE(x) = decoder(z), z ~ q(z|x)
q(z|x) = encoder output (approximate posterior)
MSE = (1/d) * sum((x_i - x_hat_i)^2)
Usage
Use this principle when building a production monitoring pipeline that needs to detect distribution shift or anomalous inputs before they degrade model performance. The trained detectors are serialized and deployed as independent model components in the Seldon Core 2 pipeline.
The typical workflow is:
- Train the classifier on reference data
- Train TabularDrift using the same reference data as the baseline distribution
- Train OutlierVAE on preprocessed reference features to learn the normal reconstruction manifold
- Save all detectors using alibi-detect's save_detector utility
- Deploy detectors alongside the classifier in a monitoring pipeline
Related Pages
- SeldonIO_Seldon_core_Alibi_Detect_Training (implements this principle) - Concrete tools for training drift and outlier detectors using alibi-detect
- SeldonIO_Seldon_core_Monitoring_Component_Deployment (next step) - Deploying trained detectors as model components
- SeldonIO_Seldon_core_Monitoring_Pipeline_Definition (uses detectors) - Composing detectors into a unified monitoring pipeline
- SeldonIO_Seldon_core_Production_Traffic_Monitoring (end goal) - Sending production traffic through the monitoring pipeline