Implementation:Scikit learn Scikit learn Pipeline Fit Predict
Overview
Concrete tool for fitting and predicting with a pipeline of transformers and a final estimator provided by scikit-learn.
Code Reference
Class: Pipeline
Module: sklearn/pipeline.py
Methods documented:
Pipeline.fit(lines 567-624)Pipeline.predict(lines 698-753)Pipeline.fit_transform(lines 638-696)
Pipeline.fit
Signature:
def fit(self, X, y=None, **params):
Fit all the transformers one after the other and sequentially transform the data. Finally, fit the transformed data using the final estimator.
Pipeline.predict
Signature:
def predict(self, X, **params):
Transform the data through all intermediate steps, and apply predict with the final estimator. Only valid if the final estimator implements predict.
Pipeline.fit_transform
Signature:
def fit_transform(self, X, y=None, **params):
Fit the model and transform with the final estimator. Only valid if the final estimator implements fit_transform or both fit and transform.
I/O Contract
fit parameters:
X: iterable -- Training data. Must fulfill input requirements of the first step of the pipeline.y: iterable, default=None-- Training targets. Must fulfill label requirements for all steps of the pipeline.**params:dictofstr -> object-- With metadata routing disabled (default): parameters passed to thefitmethod of each step, where each parameter name is prefixed such that parameterpfor stepshas keys__p. With metadata routing enabled: parameters are routed to steps that have requested them.
fit returns:
self:Pipeline-- The fitted pipeline object.
predict parameters:
X: iterable -- Data to predict on. Must fulfill input requirements of the first step of the pipeline.**params:dictofstr -> object-- With metadata routing disabled: parameters to thepredictmethod of the final estimator. With metadata routing enabled: parameters are routed to appropriate steps.
predict returns:
y_pred:ndarray-- Result of callingpredicton the final estimator.
fit_transform returns:
Xt:ndarrayof shape(n_samples, n_transformed_features)-- Transformed samples.
Implementation Details
fit method
The fit method executes the fit-transform cascade:
- Validates routing parameters via
_check_method_params. - Calls the internal
_fitmethod, which iterates through all intermediate steps (all steps except the last). For each intermediate step, it callsfit_transform(if available) orfitfollowed bytransform, passing the output as input to the next step. - Fits the final estimator on the fully transformed data by calling
self._final_estimator.fit(Xt, y, **last_step_params["fit"]). - If the final estimator is the string
"passthrough", the final fit step is skipped. - Returns
self.
predict method
The predict method executes the transform-predict cascade:
- Calls
check_is_fitted(self)to verify that the pipeline has been fitted. - Iterates through all intermediate steps (excluding the final estimator), calling
transform.transform(Xt)on each to sequentially transform the input data. - Calls
self.steps[-1][1].predict(Xt, **params)on the final estimator. - When metadata routing is enabled, parameters are routed to the appropriate steps'
transformmethods and to the final estimator'spredictmethod.
fit_transform method
The fit_transform method combines fitting and transforming:
- Runs the same
_fitcascade asfitfor all intermediate steps. - For the final step, calls
fit_transformif available, otherwise callsfitfollowed bytransform. - If the final estimator is
"passthrough", returns the already-transformed data directly.
Usage Examples
from sklearn.svm import SVC
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
X, y = make_classification(random_state=0)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
pipe = Pipeline([('scaler', StandardScaler()), ('svc', SVC())])
# fit: runs StandardScaler.fit_transform on X_train, then SVC.fit on scaled data
pipe.fit(X_train, y_train)
# predict: runs StandardScaler.transform on X_test, then SVC.predict on scaled data
y_pred = pipe.predict(X_test)
# score: internally calls predict and compares with y_test
accuracy = pipe.score(X_test, y_test) # 0.88
Passing step-specific parameters:
# Set parameters of the SVC step using the '__' separator
pipe.set_params(svc__C=10)
pipe.fit(X_train, y_train)
pipe.score(X_test, y_test) # 0.76
Full preprocessing pipeline with fit and predict:
import numpy as np
import pandas as pd
from sklearn.compose import ColumnTransformer, make_column_selector
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import make_pipeline
# Build preprocessing
preprocessor = ColumnTransformer(
transformers=[
("num", make_pipeline(SimpleImputer(strategy="median"), StandardScaler()),
make_column_selector(dtype_include=np.number)),
("cat", make_pipeline(SimpleImputer(strategy="constant"), OneHotEncoder()),
make_column_selector(dtype_include=object)),
]
)
# Build full pipeline
clf = make_pipeline(preprocessor, LogisticRegression())
# Execute: fit learns all parameters, predict applies them
clf.fit(X_train, y_train)
predictions = clf.predict(X_test)