Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Heuristic:Scikit learn Scikit learn Data Leakage Prevention

From Leeroopedia




Knowledge Sources
Domains Machine_Learning, Best_Practices
Last Updated 2026-02-08 15:00 GMT

Overview

Critical best practice: always split train/test data before any preprocessing, and use Pipeline to prevent data leakage during model evaluation.

Description

Data leakage occurs when information from outside the training set is used during model training or preprocessing. The most common form is fitting transformers (e.g., StandardScaler, feature selectors) on the full dataset before splitting into train/test, which leaks test-set statistics into the training process. This leads to overly optimistic performance estimates that do not generalize. The scikit-learn Pipeline class exists specifically to prevent this by ensuring transformations are fit only on training data during cross-validation.

Usage

Apply this heuristic whenever you preprocess data before model evaluation. It is especially critical when using StandardScaler_Init (scaling), ColumnTransformer_Init (column transformations), Cross_Validate (cross-validation evaluation), and GridSearchCV_Init (hyperparameter search). If you scale or select features before calling `train_test_split` or `cross_validate`, your results are invalid.

The Insight (Rule of Thumb)

  • Action: Always split data into train/test before any preprocessing. Use `Pipeline` to chain transformers and estimators.
  • Value: Call `fit_transform` only on training data; call `transform` only on test data.
  • Trade-off: Pipelines add slight code complexity but eliminate the most common source of evaluation bias.
  • Anti-pattern: Calling `scaler.fit_transform(X)` on the full dataset, then splitting into train/test.

Reasoning

When you fit a StandardScaler on all data, the mean and standard deviation incorporate test-set information. This means the model has indirect access to test data during training, producing evaluation scores that are higher than real-world performance. The same applies to feature selection: if you select features based on the full dataset, the test set influenced which features were chosen. The Pipeline class solves this by running `fit_transform` on training folds and `transform` on test folds automatically during cross-validation.

From `doc/common_pitfalls.rst`:

"Scikit-learn provides Pipeline which will under the right circumstances not leak the test data into the train data."

"Using the Pipeline for this purpose ensures that all the transformations are applied on the correct data subset."

Code Evidence

The Pipeline properly chains fit/predict from `sklearn/pipeline.py:567-607`:

def fit(self, X, y=None, **params):
    routed_params = self._check_method_params(method="fit", props=params)
    Xt = self._fit(X, y, routed_params)
    with config_context(
        skip_parameter_validation=(
            prefer_skip_nested_validation or self._skip_parameter_validation
        )
    ):
        if self._final_estimator != "passthrough":
            fit_params_last_step = routed_params[self.steps[-1][0]]
            self._final_estimator.fit(Xt, y, **fit_params_last_step["fit"])
    return self

Cross-validation applies Pipeline correctly from `sklearn/model_selection/_validation.py:176-178`:

cv = check_cv(cv, y, classifier=is_classifier(estimator))
# For classifiers, StratifiedKFold is used by default

Optional dependency check pattern from `sklearn/utils/_optional_dependencies.py:5-22`:

def check_matplotlib_support(caller_name):
    try:
        import matplotlib  # noqa: F401
    except ImportError as e:
        raise ImportError(
            "{} requires matplotlib. You can install matplotlib with "
            "`pip install matplotlib`".format(caller_name)
        ) from e

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment