Principle:Scikit learn Scikit learn Pipeline Execution
Overview
A sequential computation that fits all pipeline steps on training data and propagates predictions through the chain.
Description
Once a pipeline has been constructed by chaining transformers and a final estimator, it must be executed -- that is, fitted on training data and used to make predictions on new data. Pipeline execution defines how data flows through the chain of steps during both the training (fit) phase and the inference (predict) phase.
The fit-transform cascade:
During fit, data flows through the pipeline in a sequential cascade:
- The first transformer receives the raw input
Xand targety. It callsfit_transform(X, y)(or equivalentlyfit(X, y)followed bytransform(X)), learning its parameters from the data and producing a transformed outputX1. - The second transformer receives
X1andy. It callsfit_transform(X1, y), producingX2. - This continues for all intermediate transformers.
- The final estimator receives the fully transformed data
Xnand callsfit(Xn, y)to learn the model parameters.
This cascade ensures that each transformer sees data in exactly the form it expects -- the output of all preceding transformations -- and that the training labels y are available at every step for supervised transformers (e.g., target encoders).
The predict cascade:
During predict, data flows through a simpler cascade:
- Each intermediate transformer calls only
transform(X)(notfit) using the parameters learned during the fit phase. - The final estimator calls
predict(Xn)on the fully transformed data.
This separation between fit and predict ensures that no training-time computation is repeated during inference, and that the model's learned parameters are applied consistently.
Data flow integrity:
The pipeline checks that it has been fitted before allowing predict to be called (via check_is_fitted). This prevents the common error of attempting to predict with an unfitted pipeline, which would produce meaningless results or raise cryptic errors from individual steps.
Usage
Pipeline execution is invoked every time a pipeline is trained or used for prediction:
pipeline.fit(X_train, y_train)-- Runs the full fit-transform cascade on training datapipeline.predict(X_test)-- Runs the transform-predict cascade on new datapipeline.fit_transform(X_train, y_train)-- Fits the pipeline and returns the transformed training data (useful when the final step is a transformer)cross_val_score(pipeline, X, y)-- Internally calls fit and predict on different folds
Theoretical Basis
Pipeline execution implements the mathematical concept of function composition with state. Unlike pure function composition where g(f(x)) is stateless, pipeline execution has two phases:
Fitting (parameter estimation):
Each step f_i has parameters theta_i that are estimated from data:
theta_i = argmin L_i(f_i(X_{i-1}; theta_i))
where X_{i-1} is the output of the previous step and L_i is the step's objective function (which may be implicit, as in scaling where the parameters are simply the sample mean and variance).
Prediction (parameter application):
Once fitted, each step applies its learned parameters deterministically:
X_i = f_i(X_{i-1}; theta_i^*)
where theta_i^* are the fitted parameters. The final prediction is:
y_pred = g(f_n(...f_2(f_1(X; theta_1^*); theta_2^*)...; theta_n^*); theta_g^*)
This two-phase execution model is the foundation of the fit/predict paradigm that underlies all supervised learning in scikit-learn.