Principle:Scikit learn Scikit learn Pipeline Execution

Overview

A sequential computation that fits all pipeline steps on training data and propagates predictions through the chain.

Description

Once a pipeline has been constructed by chaining transformers and a final estimator, it must be executed -- that is, fitted on training data and used to make predictions on new data. Pipeline execution defines how data flows through the chain of steps during both the training (fit) phase and the inference (predict) phase.

The fit-transform cascade:

During fit, data flows through the pipeline in a sequential cascade:

The first transformer receives the raw input X and target y. It calls fit_transform(X, y) (or equivalently fit(X, y) followed by transform(X)), learning its parameters from the data and producing a transformed output X1.
The second transformer receives X1 and y. It calls fit_transform(X1, y), producing X2.
This continues for all intermediate transformers.
The final estimator receives the fully transformed data Xn and calls fit(Xn, y) to learn the model parameters.

This cascade ensures that each transformer sees data in exactly the form it expects -- the output of all preceding transformations -- and that the training labels y are available at every step for supervised transformers (e.g., target encoders).

The predict cascade:

During predict, data flows through a simpler cascade:

Each intermediate transformer calls only transform(X) (not fit) using the parameters learned during the fit phase.
The final estimator calls predict(Xn) on the fully transformed data.

This separation between fit and predict ensures that no training-time computation is repeated during inference, and that the model's learned parameters are applied consistently.

Data flow integrity:

The pipeline checks that it has been fitted before allowing predict to be called (via check_is_fitted). This prevents the common error of attempting to predict with an unfitted pipeline, which would produce meaningless results or raise cryptic errors from individual steps.

Usage

Pipeline execution is invoked every time a pipeline is trained or used for prediction:

pipeline.fit(X_train, y_train) -- Runs the full fit-transform cascade on training data
pipeline.predict(X_test) -- Runs the transform-predict cascade on new data
pipeline.fit_transform(X_train, y_train) -- Fits the pipeline and returns the transformed training data (useful when the final step is a transformer)
cross_val_score(pipeline, X, y) -- Internally calls fit and predict on different folds

Theoretical Basis

Pipeline execution implements the mathematical concept of function composition with state. Unlike pure function composition where g(f(x)) is stateless, pipeline execution has two phases:

Fitting (parameter estimation):

Each step f_i has parameters theta_i that are estimated from data:

theta_i = argmin L_i(f_i(X_{i-1}; theta_i))

where X_{i-1} is the output of the previous step and L_i is the step's objective function (which may be implicit, as in scaling where the parameters are simply the sample mean and variance).

Prediction (parameter application):

Once fitted, each step applies its learned parameters deterministically:

X_i = f_i(X_{i-1}; theta_i^*)

where theta_i^* are the fitted parameters. The final prediction is:

y_pred = g(f_n(...f_2(f_1(X; theta_1^*); theta_2^*)...; theta_n^*); theta_g^*)

This two-phase execution model is the foundation of the fit/predict paradigm that underlies all supervised learning in scikit-learn.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment